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ABSTRACT 


This  paper  examines  iterative  and  direct  methods  for  solving  Poisson's 
equation  with  regard  to  their  adaptation  to  ILLIAC  IV.   SOR,  SLOR,  ADI,  and  FACR 
are  programmed  in  GLYPNIR.   Detailed  suggestions  on  ASK  code  for  these  methods 
are  also  supplied. 

FACR,  Fourier  Analysis  and  Cyclic  Reduction,  is  the  fastest  method  on 
rectangular  meshes.   SOR,  Successive  Over  Relaxation,  seems  to  be  the  most 
promising  for  nonrect angular  meshes.   The  methods  are  between  thirty  and  forty- 
five  times  faster  on  ILLIAC  IV  than  on  a  serial  machine  with  speed  equal  to  one 
of  the  ILLIAC  IV  PEs  (Processing  Elements). 
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1.   POISSON'S  EQUATION  IN  MATRIX  FORM 

We  "will  consider  Poisson's  equation  with  Dirichlet  boundary  condi- 
tions defined  on  a  rectangular  region.   This  equation  can  be  written  as 

t(    >  =   #U(X^)   +  iu(X^)  (1) 

bY2  OX2 

where  the  value  of  U  is  given  at  the  boundaries  X  =  0,  Y  =  0,  X  =  (m-l)h. 


:■: 


and  Y  =  (n-l)h  .  We  denote  U((i-l)h   (j-l)h )  by  U     t((i-l)h  ,  (j-l)h) 
y  x      y     j->  j        x      y 

by  i|r.  .  and  error  of  order  p  by  0(p).   By  dividing  the  region  into  mesh 

points  such  that  the  distance  between  mesh  points  in  the  vertical  direction 

is  h  and  in  the  horizontal  direction  is  h  we  get  Figure  l.a.   The  region 
y  x 

in  Figure  l.a  can  be  rotated  90  clockwise  to  obtain  the  mesh  U  in  Figure 

l.b  where  U.  .  is  the  mesh  point  in  the  i —  row  and  the  j —  column. 
1  j3 


By  use  of  Taylor's  Theorem  we  obtain: 

2  3  k      ! 

b  >2TT       h3  .  3TT       h  ^k„ 

au/.  .%   xa  U/.  .v   xe  U/.  .n   xa  u  ,.  ,x 

1+1,3    i,o    x  9X        ^  ^  5-   dx3  <+.  ^Y 

2  3  U  i 

^TT  h    ^2TT  h    ^3TT  h    .UTT 

tt     tt    v  oU/.  ^   x6U/.  .  v   x  a_U/.  .  >   x  a  U/.  .  \ 

2  3  4   . 

2  3  k 


/ 

< 

y 

0,n- 

1 

l,n- 

1 

2,n- 

1 

m-3,n- 

•  1 

m-2,n.-l    m-l,n-l 

< 

0,n- 

2 

• 

l,n- 

2 

•  i 

2,n- 

.2 

1 

1^  m-3,n- 

•2 

• 

m-2,n-2     m-l,n-2 

7        ii          ' 

, 

0,n- 

3 

• 

l,n.- 

3 

•  i 

2,n- 

3 

m-3,n- 

•3 

• 

m-2,n-3     m-l,n-3 
n 

i 
i 

t 

,o,2' 

• 

1,2 

•  > 

2,2 

l#  m"3,2 

• 

i 
1 

m-2,2       m-1,2 

0,1 

• 
0 

1,1 

#1 

2,1 

j     m-3,1 
i,  m-3,0 

• 

m-2,1       m-1,1 
m-2,0  ^    m-1,0  ^ 

T 

h 

iy     - 

0,0 

* 

1,0 

2,0 

* 

j 

S 

l<   h 

J| 

X 


a)  The  rectangular  region  with  the  mesh  points. 
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b)  The  mesh  points  with  separation  of  boundary  and  interior  mesh  points. 


Figure  1.   The  Mesh 


52U 

Thus 


,  (i,j)  =  ~   (U.  .  .  +  U.  .  .  -  2U.  .)  +  0(h)  (2) 

,2   1,0+1   1,3-1    1,3     y 


8Y~         h 

y 


A  ^l 

and  — ^  - 

SX         h' 
x 


&•*>  ■  72  (Ui+l,o  +  Ui-l,j  "  2Ul,d>  +  °(hx>  (3) 


Substituting  (2)  and  (3)  into  (l)  we  obtain 

—      (U.  .,,  +  U.  •  ,  -  2U.  .)  +  ~  (U.  .  .  +  U.  -  .  -  2U.  ,)  »  \|f.  .  •   (if) 
h2   v  1,0+1    1,0-1     1,3    h2   l+l,  J    1-1,3     1,3'    1,3 

y  x 

This  is  the  equation  for  any  interior  mesh  point  U.  ..   The  values  of  the 

1,3 

mesh  points  on  the  boundary  are  given,  and  require  no  equations.   Let  us 

take  all  the  equations  for  the  interior  mesh  points  and  place  the  constant 

terms  on  the  right  hand  side  of  the  equations.   For  example,  the  equation 

for  IL  .  where  3  <  j  <  n-2  will  be 
2,0       J  _  ^  _ 

y  xx 


The  equation  for  Up  p  will  be 

J>  (U2,3  -  2U2,2>  +  J>  <U3,2  "  2U2,2)  ~  *2,2  "  k  \s   "  J>  U2,l 

y  X  ■         X  y 


Combining  the  constant  terms  into  one  term,  M.  .,  for  each  U.  .  we  obtain 

'      1,3'  1,3 

M.     .    =  *.     .   -  ~   (U.     .   _      5.    _   „    +  U.     .    .    8.    ,    _)   -  ■—■   (U.    .     .8.    ,„ 
1,3  1,3  2        i,3  +  l       0+l,n  1,3-1     3-1,1         ^        1+1,3      i+l,m 

y  x 

4-  U.    _    .   8.    _    ,)  (5) 

1-1,3      1-1,1  v 

.  c  (   0  if  k  M 

where    \,i  -   difklr 


Consider  the  internal  mesh  points  as  a  vector  U_  in  which- the  order  of 
the  mesh  points  is  as  follows: 

1)  The  first  interior  point  in  the  first  row  becomes  the  first 
element  of  the  vector. 

2)  All  interior  points  of  row  i  are  placed  in  the  vector  in  the 
same  order  they  appear  in  row  i. 

3)  The  (n-l) —  element  of  the  i —  row  is  followed  by  the  first 
interior  point  of  the  following  row,  where  2  <;  i  <  m-2. 


The  transpose  of  U-r  equals 


(U2^2,  U^3,  . ..,  U2,n-i'  U3,2'  U3,3'  ' 


•  •  > 


u 


,n-l,'       m-1,2  m-1,3' 


U_ 


m. 


■l,a-l)  . 


Using  this  order  we  assemble  the  interior  mesh  point  equations  and  consider 
them  as  the  matrix  equation  AU  =  M  where  M  is  the  vector  of  constant  terms 
and  A  is  a  matrix  made  up  of  the  coefficients  of  the  mesh  points. 

The  matrix  A  is  a  (m-2)(n-2)  by  (m-2)(n-2)  block  tridiagonal  matrix 
of  the  following  form: 


A  = 


C    B 
C 


(6) 


where  C  =  al  _ , 
n-2 


+1 

h2 


and  B  is  a  (n-2)  by  (n-2)  tridiagonal  matrix  of  the  form 


B  = 


"^ 


e     d 

d     e     d 


\     N 


\     \     \ 

\     \      V 

\   \    v 

d      e     d 
d     e 


(6.1) 


i 

J 


where  d  =  —  and  e  =  -  (2a  +  2d). 


h 


y 


There  are  two  general  approaches  used  in  solving  this  system  of  linear 
equations,  direct  methods  and  iterative  methods.*  In  this  study  we  have 
limited  our  discussion  to  three  popular  iterative  methods  and  one  direct 
method:   the  successive  over-relaxation  method  (SOR),  the  successive  line 
over-relaxation  method  (SLOR),  the  alternating  direction  implicit  method 
(ADI),  and  Hockney's  direct  method  (FACR).    In  the  discussions  of  FACR  three 
types  of  boundary  conditions  will  be  examined:   periodic,  Neumann  and  Dirichlet. 
The  discussion  of  iterative  methods  considers  only  Dirichlet  boundary  conditions, 


*  No  matter  which  method  is  used  in  solving  the  equations  the  lower  bound 
on  the  error  is  0  (max(h2,  h2)).   This  is  due  to  the  error  terms  in  (2) 


and  (3) 


y  x' 


2.   IMPLEMENTATION  OF  ITERATIVE  METHODS  ON  ILLIAC  IV 
2.1   Introduction  to  SLOR  and  SOR 

Iterative  methods  have  one  thing  in  common.  They  each  begin  with 
an  initial  guess  to  the  solution  and  improve  upon  the  guess  with  each 
iteration.   Thus  the  following  notation  is  useful:   The  value  of  U  after 
the  £ —  iteration  is  denoted  by  Ul   .   U    is  the  initial  guess  and 

if  the  method  converges  then  Lim  (Uj.   -  U      )  =  0. 


Matrix  A  can  be  divided  into  three  matrices  such  that  A  =  D  -  E  -  F 
where  D  is  block  diagonal,  E  is  strictly  lower  triangular  and  F  is  strictly 
upper  triangular.   Thus  AU  =  M  can  be  written  as  DIL.  -  (E  +  F)  U  =  M. 

From  this  we  get  the  iterative  method: 

U<i+l)  =  D"1  (E  +  F)  u[£)      +  D"1  M  (7) 

The  matrix  equation  can  also  be  grouped  to  obtain 

yU+1)   =  (D.E)-1  FUU)  +  (D-E)"1  M  (8) 

The  asymptotic  convergence  rate  of   (8)   is  twice  as   fast   as    (7)    [Todd, 
Page   391  J.      This   is   due  to  the   fact   that  the  matrix  has   Young's  property  A 

[Wachspress,    Page  102].       (8)    is   a  successive  method  while    (7)    is   a 

** 

simultaneous  method. 

By  defining  J  =  -  (D  -  u>E)   and  H  =  -  (wF  +   (l-w)D)    for  u  ?  0  we  get 

U)  0) 

u  JU-j.  =   wHU     +   coM  from  AU     =  M  =    (J-H)U   . 


*      Improvement    is  measured  by   a  reduction   in  the  error  vector. 

**     Successive  methods   use  new  information   as   soon  as    it    is   available 
while  simultaneous  methods   use  old  values  for  the  entire   iteration. 


We   can  write  ojJU     =  wHU     +  ooM  as 


;D-goE)Ui('',"1)    =    (u)F  +    (l-w)D)   U^    +  com 


(9) 


Note  that  (8)  and  (9)  are  equivalent  if  u  =  1.   When  0  <  w  <  2  and  A  is 
positive  definite,  (9)  converges.   If  1<  to  «.2,  then  (9)  is  referred  to 
as  an  over-relaxation  method  [Forsythe  and  Wasow,  Page  26l ] .   With 

different  values  of  to,  the  convergence  rate  of  (9)  is  altered.   For  a 

2 
single  relaxation  parameter,  it  can  be  shown  that  oj  =  -    — 7   , 

b   1  4/1-7(8) 
where  £  =  D   (E+F),  the  Jacobi  matrix,  and  p(8)  is  the  spectral  radius 

of  6,  gives  the  fastest  rate  of  convergence  [Young,  Page  169 ] • 

There  are  two  common  ways  in  which  equations  (7  -  9)  are  implemented  • 
The  first  gives  point  iterative  methods  by  letting  D  equal  the  diagonal  of 
A.   Thus  (7)  becomes  Jacobi 's  method,  (8)  becomes  the  formula  for  succes- 
sive point  relaxation  iterative  method  (SR)  and  (9)  becomes  the  successive 
point  over  relaxation  iterative  method  (SOR). 


If  we  let 


D  = 


n 


where  B  is  defined  in  (6.1)  then  we  will  get  line  iterative  methods 
(7)  becomes  the  simultaneous  line  iterative  method,  (8)  becomes  the 


successive  line  relaxation  iterative  method  (SLR)  and  (9)  becomes  the 
successive  line  over-relaxation  method  (SLOR)  [Varga,  Page  1991 • 

To  give  the  reader  a  better  understanding  of  the  way  these  methods 

work,  we  write  the  individual  equations  for  the  interior  mesh  points 

U.  . ,  where  3  <  i  <  m-2  and  3  <  j  <  n-2  for  each  method.   First  we  de- 
1,  0 

fine  h,  v,  and  g  as  follows: 

x2  v2  U2      V2 

h  h  h     h 

h  =  p-^ — p-  ,         v  =  ^ p-  ,       and  g  =  -£— ^p- 

2(*i  +  Yi)  2(ii  +   li  )  2(li  +  li  ) 

v  x    y  x    y  v  x    y 


Jacobi  's  method  (simultaneous  point  iterative  method) 

u^t1'  .T(n.W  ,  +n^  n)  +  h(uW  . +uW  .,.  g„. 
i,d       1,0+1   1,0-1     1+1,  a   1-1,0      1^ 


Simultaneous   line  iterative  method 

U?i+1)   =     v   (U.'t1*      +  U<it1»)+  h   (l!1'    .   +  TlW    .)   -     g  M.     .    . 

1,0  1,0+1         1,0-1  1+1,0       1-1,0  1,0 


Successive  point  relaxation  method   (SR) 

i,0         1,0+1      i,j-l       i+l,0    i-l,J       i>0 


(10) 


*  Note  we  have  excluded  the  first  row,  last  row,  first  column,  and  the 
last  column  of  the  interior  mesh  points  to  avoid  boundary  conditions. 
These  points  will  be  discussed  in  another  section. 


Successive  line  relaxation  method  (SLR) 

(Ui)    .  T  (UU+D    +    ^1),  +  h  (UU)     +  „(*!)>  .  (11. 

1,0  1,0+1  1,0-1  1+1,0         1-1,0  1,0 


Successive  point  over-relaxation   (SOR) 
i,0  1,0+1        1,0-1  1+1,0  1-1,0  1,0 

+    ^      "if]"  (13) 


Successive   line  over- relaxation   (SLOR) 

TT(i+l)  /TTU+D        tt(^+1)\  v,    fTTU)  TT^1^ 

U.     .        =     v   (U.     .   ;   +   U.     .    '     +  u   h     U:    '     .     +   U.    ,     . )    -  oo    g  M.     . 

i,0  1,0+1  i,0-l  i+l,0  1-1,0  i,0 

+  (l.u)    (UU).    -        v        (u[£).    .   +   UU?    _)).  (13) 

1,0  1,0+1        1,0-1 


2 . 2     Introduction  to  API 

Let  us   look  at  the   equation  we  are   solving  once  again, 

gutgri  +  ^yi  ^  Y  (     }  (1) 

ax2  or2 

as  before  we  use  Taylor'  s   Theorem  and  we  get 

2 

Ml   (i,0)    =  ^o      (U.     .    n   +   U.     .    -    -   2U.     .)  +    0(h2)  (2) 

sy2  h2         1,3-1         1,3+1  i,j;  y; 

y 


and 

2 


LU(1;.)      i      (  +  n^       .  ,  +  ^  (3) 


5X  h 

x 
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Thus  when  we  form  the  matrix  equation  AU_  =  M  where  A  is  the  matrix  (6), 
we  are  considering  a  horizontal  part  of  the  equation  and  a  vertical  part 
of  the  equation.  We  could  divide  A  into  two  matrices  H  and  V  such  that 
H  +  V  =  A  where  V  is  the  matrix  containing  the  coefficients  of  the  vertical 
equations  (2).   Both  H  and  V  can  be  placed  in  tridiagonal  form  with  the  ap- 
propriate permutation  matrix. 

We  can  take  the  equation 

AU  =  HU  +  VU  =  M  and  obtain 


(H  -  ^jrOU-j.  =  M  -  (V  +  w^l)^     and 

(V  -  u^DUj   =  M  -   (H  +  o^DUj   . 

* 

Thus  we  can  get  the  Peaceman-Rachford  AD I  scheme  [  Varga,   Page  212] . 

Ui(^V2)   =   (h_^hI)-1[m.(v  +  ^hI)Ui(.)]  <!*.!) 

U^30    =    (V  -  u)ivl)_1    [M  -    (H  +a)ivl)U^+1/2)]  (14.2) 

The  parameters,  w.  and  u>  «  are  defined  in  [Wachspress,  Page  192]. 

It  will  help  to  have  the  corresponding  equations  for  the  individual 
mesh  points,  U.  .,  where  3  <  i  <  n-2  and  3  <  0  <  m-2. 

X 
X 


*  ADI  requires  that  A  be  positive  definite.   This  can  be  obtained  by  using 
-AU  =  -M  or  by  using  -w?v  and  -w.„  instead  of  ol„  and  u  „.  We  use  the 

latter  technique. 


11 

„(*!)  .  -^i__   [M.  .  -  ^  (U^V2)  +  u^l/2))  .  (     .  _|)  uti.1/2) 

y 


Examining  (15-1)  and  (15.2)  we  note  a  strong  resemblance  to  simultaneous 
column  relaxation  followed  by  simultaneous  row  relaxation. 

2 . 3  Implementation  of  SOR  on  ILLIAC  IV 

2.3.1  A  Parallel  Processor's  Effect  on  the  Algorithm 

A  parallel  machine  allows  operations  to  be  performed  simultaneously 
which  are  done  in  separate  time  intervals  on  a  conventional  machine.   This 
is  referred  to  as  overlap.   The  amount  of  overlap  is  dependent  on  the  way 
in  which  the  algorithm  is  programmed.   The  SOR  algorithm  supplies  an  il- 
lustration of  how  the  flow  of  an  algorithm  is  changed  when  programmed  to 
maximize  overlap  on  a  parallel  machine. 

Let  us  examine  the  equation  for  SOR  closer: 

for  3  <  i  <  m-2  and  3  <  j  <  n-2.   Note  the  interior  points  which  are  adja- 
cent to  one  or  more  boundary  points  are  not  included.   Also  note  that  for 

the  above  mesh  points  M.  .  -  Y.  .•   The  equation  for  U    would  be 

i  >0  i>  0  2,2 

if]  - u  v  it I +  - h  itl  - » *  v +  cm )  „(i)  (16) 


where 


Thus 


2,2        *2,2        h2        Ul,2  2      U2,l    ' 

X  hy 


M  =   g  ¥  -vU  -  h   TJ 

2,2        &   Y2,2        V   U2,l        n   Ul,2    ' 
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This  enables  us  to  rewrite  (l6^  as 


To  calculate  U    using  (17)  takes  two  more  additions  than  using  (16) 
2  >2 

but  M    is  needed  to  use  (16).   On  a  serial  machine,  one  would  use 
2?2 

equation  (16)    to  improve  U   .   The  choice  is  not  so  straightforward  on 

a  parallel  machine  if  when  U    is  calculated  in  one  processing  element 

2,2 

(PE),  U.  .  is  calculated  in  another  PE  where  3  <  i  <m-2  and  (or) 

3  <  j  <  n-2.   In  this  case  to  calculate  U    using  (.16)  we  would  have  to 

turn  the  appropriate  PE  off  for  two  additions.   By  using  (l7)>  we  would 

not  have  to  turn  off  PEs  for  those  additions  and  would  also  save  the 

calculation  of  M   .   A  similar  argument  can  be  given  for  all  points 
2,2 

adjacent  to  boundary  points.   Thus  the  choice  of  the  equation  t©  evaluate  a 
point  adjacent  to  the  boundary  in  one  PE  depends  on  what  is  being  done 
in  the  other  PEs. 


2.3.2  Parallelisms  in  SOR 

Examining  equation  12,  we  note  that  the  values  of  U.  .  .  and  U.  .  1. 

1-ljtJ  1j0~-'- 

are  needed  to  evaluate  U.  .   .   Now  consider  the  following  mesh: 


1,J 

0 

0 

0 

0 

0 

0 

0 

1 

2 

3 

k 

0 

0 

2 

3 

k 

5 

0 

0 

3 

k 

5 

6 

0 

0 

k 

5 

6 

7 

0 

0 

0 

in  r\  a  -r 

0 

0 

int-      1 

0 

0 

TTU+D 

Whenever  U.  .  is  a  boundary  point  U.  .  =  U.  .   for  all  £. 
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0  indicates  a  boundary  point.   The  interior  mesh  points  are  divided  into 

seven  sets  (1,2,3* •••  7)  where  set  i  contains  all  points  labeled  by  i. 

To  improve  any  point  in  set  i,  we  must  first  improve  all  the  points  in 
the  union  of  all  sets  j  such  that  j  <  i. 

In  a  computer  with  h   PEs,  one  iteration  could  be  completed  in  7  time 
steps  where  a  time  step  is  the  amount  of  time  to  improve  a  mesh  point  in 
one  PE  by  improving  one  set  in  each  time  step.   But  the  computer  could  im- 
prove 28  points  in  7  time  steps.   To  improve  the  efficiency  of  the  method, 
we  look  at  what  could  be  done  in  each  time  step  assuming  we  had  done 
everything  possible  up  to  that  time  step. 


Time  Step 

1 

2 

3 

k 

5 

6 

7 

8 

9 

10 

11 

What 

i1 

21 

31 

k1 

51 

61 

71 

Can  Be 

Is 

i 

2 
3 

k2 

2 
5 

62 

72 

Done  At 

I3 

23 

33 

U3 

53 

63 

73 

That  Time 

l* 

2k 

3k 
I5 

25 

5U 
35 

I6  I 

Table  1.   Parallelism  of  SOR 

£  th 

i   denotes  the  I —  improvement  of  the  elements  in  set  i. 


In  Table  i  we  see  that  steps  1  to  5  improve  less  than  8  elements  but 
8  elements  are  improved  at  each  time  step  following  step  5.   Thus  if  we 


1U 


have  a  computer  with  o  PEs  we  can  compute  U    in  5  +  2£   time  steps.   If 
we  assume  that  our  initial  guess  is  1  ,  2  ,  3  *  h  ,    5  ,   6  ,  7  ,  instead 
ofl,2,3j^j5)6,7   then  we  can  start  at  step  6  and  do  every 
iteration  in  2  time  steps. 


This  can  be  generalized  for  an  arbitrary  mesh  as  follows:   First  we 
group  all  the  interior  mesh  points  into  two  sets, 

ODD  =   {U.  .    i  +  j  is  odd},    EVEN  =  (U.  .  |  i  +  j  is  even} 

1*3  1* 3 

Time  step  1  improves  all  the  points  in  EVEN.   Time  Step  2  improves 
all  the  points  in  ODD.  We  will  call  this  method  modified  successive  over- 
relaxation  (MSOR). 

Remember  all  the  points  in  ODD  can  be  calculated  using  only  points 
from  EVEN  and  all  the  points  in  EVEN  can  be  calculated  using  only  points 
from  ODD.   Thus  we  can  write  a  two  part  algorithm  for  MSOR  [Young,  Page  271]. 

u^t1'  .«„  (v  (u<«  ,  +  uW  .)  +  h  (u<«  .  +  u«>  ,)  -  gT.  ,) 

1,3      Ei  v   v  i*3+l    1*3-1        i-l, J    i+l, 3       i*3 

+  (lKV  ui,j  (18.1) 

for  all  the  points  in  EVEN 

u!*1'  -  »„  (v  (u^1'  +  u«+\>)  +  h-(u^«  +  u^1')  -gf.J       I 

i,j      Di  v     x,J+l    1*3-1        1*3+1    1*3-1       i*3 


+  d-M)  u :  (18.2) 


for  all  the  points  in  ODD-. 
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In  Young's  discussion  of  MSOR,  the  optimal  w  ,   and  jo  .  for  the  £ 
iteration  are  calculated.  We  will  limit  our  study  to  MSOR  where  u  . 

-  wm  =  V 


Intuitively  one  would  think  MSOR  and  SOR  have  very  close  rates  of  con- 

AL 

vergence.   In  fact,  their  "asymptotic  rates  of  convergence"  are  equivalent. 

If  A  is  a  convergent  n  x  n  complex  matrix,  for  all  £   sufficiently  large,  the 

£  £ 

average  rate  of  convergence  for  £   iterations,  R(A  ),  is  Lim  R(A  )  = 

n 

JO   -»  °° 

-lnp(A)  =   R  (A)  [Varga,  Page  67].   Thus  for  a  sufficiently  large  £,    the 
asymptotic  rate  of  convergence  equals  the  average  rate  of  convergence.   We 
know  if  P  is  a  permutation  matrix  that  p(PAP   )  =  p(A)  [  Birkhof f  and 
Maclane,  Page  2*4-9].   By  multiplying  (9)  "by  P  we  get 

PU(^D  =  P  (D-oE)"1  f>F  +  (1^)D]  P"Wi)  (19) 


The  asymptotic  rate  of  convergence  for  (19)  equals 
-Ln  p(p(D^oE)"1  \uF   +  (l-^  )D}  p_1)  =  -Lnp(  (D^aE)"1  f>F+  (l-w)D})  (20) 

which  equals  the  asymptotic  rate  of  convergence  for  SOR. 

This  not  only  shows  that  SOR  and  MSOR  have  the  same  asymptotic  convergence 
rate,  but  that  SOR  applied  to  any  permutation  of  U  gives  the  same  asymptotic  rate 


The  negative  of  the  logarithm  of  the  spectral  radius  of  a  convergent  matrix 
A  is  the  asymptotic  rate  of  convergence,  R  (A),  for  the  matrix  A  [Varga, 
Page  67]. 
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of  convergence.   Thus  we  can  choose  the  ordering  of  U  which  best  fits  the 
machine  being  used. 


2.3.3  SOR  using  Straight  Storage 

Now  we  shall  examine  the  implementation  of  SOR  on  ILLIAC  IV  [Rudsinski]. 

We  will  compare  two  storage  schemes  and  examine  their  effect  on  machine  effi- 

* 

ciency  and  their  time  per  execution. 

The  first  storage  scheme,  Straight  Storage,  stores  all  the  elements  of 
row  i  before  any  of  the  elements  of  row  j  if  i  <  j .   The  elements  of  each 
row  appear  in  the  order  they  are  found  in  the  mesh.   The  first  element  of 
each  row  is  stored  in  PE  0  and  the  last  element  of  the  row  is  stored  in 
PE  I  where  I  equals  (n-l)  mod  6U ,  n  is  the  number  of  columns.   Figure  2  should 
help  clarify  this  storage  scheme. 

How  can  we  implement  SOR  using  Straight  Storage?  We  observe  that  the 
PEs  which  contain  the  odd  elements  of  mesh  row  I  contain  the  even  elements 
of  mesh  row  I  +  1.   This  enables  us  to  improve  the  odd  elements  of  Rows  I  and 
I  +1  simultaneously  by  using  a  PE  integer  index  which  accesses  an  element 


3£ 

The  timing  method  is  explained  in  Appendix  A.   The  terms  used  in  timing 
are  defined  there  also. 
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Row  U. 


Row  U.  . 
3+1 


Row  U 


3+K-l 


Row  U 


J+K 


PEM  0 


U 


3,1 


U 


3,65 


u. 

3,  n  -I 


U 


3+1,1 


PEM  1 


U 


3,2 


U 


U 


3,n+l-I 


3+1,2 


PEM  I 


uj,ni 


U 


3,65+1 


u. 


3>n 


U 


3+1,1+1 


PEM  63 


u 


3,6i^ 


U3,128 


3+1, 6i+ 


Figure  2.   Straight  Storage  of  Row  j  of  U 


in  row  I  (i+l)  in  the  even  (odd)  PEs  -when  I  is  even-   In  a  like  manner  we 
can  access  all  the  even  elements  of  two  consecutive  rows.   If  we  have  an 
odd  number  of  rows  we  can  improve  the  odd  elements  of  the  last  row  and  the 
even  elements  of  the  first  row  simultaneously  using  a  similar  technique. 

Given  a  mesh  with  m  rows  and  n  columns,  we  use  the  index  described 
above  so  that  v/e  can  improve  two  rows  simultaneously.   Each  pair  of  rows  can 


K  stands  for  the  number  of  rows  of  PEM  required  to  store  n  words  and 
I  =  (n-1)  mod  6k. 
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XL 

be  divided  into  K  groups  where  K  =  [n/6ki      if  n  mod  6k   =  0  otherwise 

K  =  Ln/6UJ  +  1.   Using  Phase  1  we  can  improve  6k   elements  at  a  time 

|_(n-2)/6UJ  times  and  then  use  Phase  2  to  improve  the  remaining  interior 

points.    If  we  use  Phase  1  to  improve  the  first  6k   odd  (even)  interior 

points  of  a  pair  of  lines  then  we  must  take  into  account  that  only  63  of 

them  are  found  in  the  same  pair  of  rows  of  PEM  because  of  the  boundary 

points.   This  means  we  must  use  special  indices  to  reach  the  6Uth  odd 

(even)  interior  point.   This  requires  no  extra  computer  time  because  the 

number  of  times  register  X  is  changed  is  not  affected.   SOR  requires  not 

only  the  point  to  be  improved  but  its  four  neighbors  and  the  appropriate 

value  of  Y. 

For  example:   To  improve  U.  .  we  need  U.  n  .,   U .  .  .,  U   .  ,,  U.  .  , , 

i,0         1-1,0   i+l,  J   i,J+l   1,0-1 

U     and  \|r 
i,0       i,0« 

These  values  will  be  accessed  by  a  combination  of  ACAR  and  Register  X 
indexing.   To  do  the  indexing,  K  will  be  combined  with  the  CU  integer  Cl-equal 
to  the  first  row  of  PEM  containing  mesh  points  to  be  improved  during  this 
execution  -  and  three  PE  integers. 

PI=(1,K,0,K,0,...,0,K),   PR=(K,0,K,0,...,K,0,)  and  PL=(K+1, 1,K,0,K,0, . . . ,K,0) 
if  we  are  improving  the  odd  points; 

PI=(K,1,0,K,0,K,...,K,0),  PR=(0,K,0,K,...,0,K)  and  PL=(l,K+l,0,K,0,K, . . . ,0,K) 
if  we  are  improving  the  even  points; 


*  Ipj  equals  k  where  p  is  a  real  number,  k  is  an  integer,  and  k<  p  <  k+l. 
**  Phase  1  and  Phase  2  are  defined  in  Appendix  D. 
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INITIAL  CONDITIONS 

Register 

Contents 

Register 

Contents 

$  X 

PI 

$  D3 

WI 

$  DO 

H 

$  CO 

CI 

$  Dl 

V 

$  CI 

CM 

$  D2 

W 

$  C2 

CP 

STEP 

ASSIGNMENT 

OPERATIONS  AND  OVERLAP 

TIME  IB 

NO. 

STATEMENTS 

CLOCKS 

1 

$A:  =  U[$X]  (0) 

PE  fetch 

7 

2 

$C3:  =  $D3;  $A:  =  $A*$C3 

CU  fetch 

Real  multiplication 

10 

3 

$S:  =  $A  -  i|r[$X]  (0) 

Real  subtraction 

(PE  fetch  overlapped  by 

multiplication) 

7 

k 

$A:  =  U[$X]  (2) 

(PE  fetch  overlapped  by 

addition) 

0 

5 

$A:  =  $A  +  U[$X]  (1) 

Real  addition  and  PE  fetch 

Ik 

6 

$X:  =  PL 

(PE  fetch  overlapped  by  addition) 

0 

7 

$C3:  =  $D0;  $A:  =  $A*$C3 

CU  fetch 

Real  multiplication 

10 

8 

$R:  =  U[$X]  (0) 

(PE  fetch  overlapped  by 

multiplication) 

0 

9 

$A:  =  $A+  $S 

Real  addition 

7 

10 

$X:  =  PR 

(PE  fetch  overlapped  by  addition) 

0 

11 

$B:  -  RTL  (1,,$R) 

Route 

3 

12 

$R:  =  U[$X]  (0) 

(PE  fetch  partially  overlapped 

by  Route) 

k 

13 

$S:  =  $A 

Load  from  PE  Register 

1 

Ik 

$A:  =  RTR  (1,,$R) 

Route 

3 

15 

$A:  =  $A  +  $B 

Real  addition 

7 

16 

$C3:  =  $D1;  $A:  -  $A*$C3 

(CU  fetch  overlapped  by  addition) 

Real  multiplication 

9 

17 

$X:  =  PI 

(PE  fetch  overlapped  by 

multiplication) 

0 

18 

$A:  =  $A  +  $S 

Real  addition 

7 

19 

$C3:  =  $D2j  $A:  =  $A*$C3 

(CU  fetch  overlapped  by  addition) 

Real  multiplication 

9 

20 

U[$X]  (0):  =  $A 

PE  store 

7 

TOTAL  TIME 

105 

Table  2.   Timing  of  an  Execution  of  Phase  1  SOR  using  Straight  Storage 
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Let  W  =  u>g,  V  =  v/g,  H  =  h/g,  and  Wl  =  — —  where  g,  h,  and  v  are  defined 

in  1.2. 

Now  (IT)  can  be  written  in  GLYPNIR  as  follows: 

CP:  =  C  +  K; 

LOOP  CI:  =  C,  K  +  K,  CK  DO  BEGIN 

CM:    =   CI  -   K; 

CP:    =  CI  +  K; 

(21) 
U   [PI   +    CI]:    =  W*(V*(RTR(1,,U[FR+CI]  )+  RTL(l,,U[PL+Cl] ) )      +   H*(U[PI+CM] 

+  u[pi+cp])  -\|r[Pi+ci]  +    wi*u[pi+ci]   ); 

END; 

By  placing  CI  in  ACARO,  CM  in  ACAR1,  CP  in  ACAR2,  H  in  $D0,  V  in  $D1,  W 

in  $D2,  Wl  in  $D3  and  PI  in  $X  (21)  could  be  translated  into  ASK  as  outlined 

in  Table  2. 

2.3.1*  SOR  using  Odd-Even  Storage 

Now  we  will  look  at  another  storage  scheme  to  see  if  we  can  improve 
the  data  access  time.   Given  an  m  by  n  mesh  divide  each  row  into  k  +  1  segments 
such  that  k  segments  are  128  mesh  points  long  and  the  last  segment  is  £   points 
long,  where  0  <  -#  <  128.   Use  two  adjacent  lines  of  PEM  to  store  each  segment. 
Store  all  the  red  (black)  points    of  the  segment  in  the  first  (second)  line 
of  PEM.  The  segments  appear  in  the  same  relative  order  as  they  appeared  in 
Straight  Storage.   This  storage  scheme  will  be  called  Odd -Even  Storage. 
For  an  example  see  Figure  3« 


Appendix  A  explains  the  notation  and  timing  method  used  In  Table  k. 

A  point  that  appears  in  an  odd(even)  numbered  column  is  considered  a 
red  (black)  point. 
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If  &  <  6k,   then  Straight  Storage  requires  m  brl   lines  of  PEM  while  Odd- 
Even  Storage  required  m  (  MM  +l)  lines  of  PEM.   If  &   >  6k   then  both  Straight 


Storage  and  Odd-Even  Storage  require  m 


Vk 


lines  of  PEM. 


PEM  0 


PEM  1 


PEM  I 


Row  U 


J 


Row  U 


j  +  l 


Row  U 


j+K-2 


Row  U 


j+K-1 


'- 

1 

1  J,1 

\i,2 

:       i 

u. 

3,n- 

t 

•i+il 

U. 

i 
i 

■i+2J 
! 

i 

.u 


j,n-i+3 


U 


J,n-^+U 


U 


d,n-l 


J>n 


PEM  63 

• 

\l27 

Uj,128 

• 

i 

1 

• 
i     : 
1 

Figure  3«   Odd-Even  Storage  Data  Allocation  of  Row  j  of  U  Where  n  is  Even. 


r pi  =  k  where  k  is  an  integer  such  that  p  <  k  <  p  +  1. 
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INITIAL  CONDITIONS 

Register 

Contents 

Register 

Contents 

$  X 

PI 

$  D3 

Wl 

$  DO 

H 

$  CO 

CI 

$  Dl 

V 

$  CI 

CM 

$  D2 

W 

$  C2 

CP 

STEP 
NO. 

ASSIGNMENT 
STATEMENTS 

OPERATIONS  AND  OVERLAP 

TIME  IK 
CLOCKS 

1 
2 

$A:  =  U[$X]  (0) 
$C3:  =  $D3;  $A:  • 

=  $A*$C3 

PE  fetch 
CU  fetch 
Real  multiplication 

7 
10 

3 

$S:  =  $A  -  \|r[$X] 

(o) 

(PE  fetch  overlapped  with  mult.) 
Real  subtraction 

7 

k 

$C3:  =  $C0  +  1 

(CU  operations  overlapped) 

0 

5 

$R:  =  RTL  (1,,U(3)) 

(PE  fetch  overlapped  by  addition) 
Route 

3 

6 

$A:  =  $R  +  U[$X](3) 

(PE  fetch  partially  overlapped  by 
Route)  Real  addition 

11 

7 

$C3:  =  $D1;  $A: 

=  $A*$C3 

(CU  fetch  overlapped  by  addition) 
Real  multiplication 

9 

8 

$R:  -  U[$X](1) 

(PE  fetch  overlapped  by  multiplication' 

0 

9 

$S:  =  $A  +  $S 

Real  addition 

7 

10 

$A:  =  $U[$X]  (2) 

+  $R 

(PE  fetch  overlapped  by  addition) 
Real  addition 

7 

11 

$C3:  =  $D0;  $A:  = 

=  $A*$C3 

(CU  fetch  overlapped  by  addition) 
Real  multiplication 

9 

12 

$A:  =  $A  +  $S 

Real  addition 

7 

13 

$C3:  =  $D2;  $A: 

=  $A*$C3 

(CU  fetch  overlapped  by  addition) 
Real  multiplication 

9 

Ik 

U[$X]  (0):  =  $A 

PE  store 

7 

TOTAL  TIME 

93 

Table  3-   Timing  of  an  Execution  of  Phase  1  SOR  using  Odd-Even  Storage 
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Phase  1  of  the  implementation  of  SOR  using  Odd-Even  Storage  requires 
only  one  PE  index  and  one  route.   We  can  access  the  data  using  PI  =  (2,0,0,0. . ,0) 
and  CI  equal  the  first  row  of  PEM  containing  mesh  points  to  be  improved  during 
this  execution.   We  need  two  assignment  statements:   (22.1)  for  the  red 
points,  and  (22.2)  for  the  black  points. 

Using  K,  W,  V,  H  and  Wl  are  defined  in  2.2.3,  SOR  using  Odd-Even  Storage 
can  be  written  in  GLYPNIR  as  follows : 

LOOP  CI:  =  C,  K,  CK  DO  BEGIN 

CM:  =  CI  -  K; 

CP:  =  CI  +  K; 

U[PI  +  CI]:    -  W*(V*(RTR(1,,   U[CI+l])   +  U[PI  +  CI  +  1]) 
+   H*(U[PI   +   CM]    +  U[PI  +   CP])    -   <F[PI   +   CI]    +  WI*U[PI   +   CI]);         (22.1) 

U[CI]:    =  W*(V*(RTL(l,,U[PI  +  CI   -   l])   +  U[CI   -  l]) 

+   H*(U[CM]    +  U[CP])-   V[CI]    +  W1*V[CI]);  (22.2) 

END; 

Table    3  suggests  how   (22.1)   could  be   coded  in  ASK.       (22.2)   requires 
different  code  but  the  logic   is  similar.        The  code   for  SOR  using  Odd -Even 
Storage   is  lCftfo  faster  than  the   code   for   SOR  using   Straight  Storage.      This  is 
due  to  a  savings   in  memory  fetching  and  the  elimination  of  one  route. 

2.3.5      SOR  Machine  Efficiency 

If  a  mesh  has  m  rows    and  32    (6h)    columns   and  straight    (odd-even)    storage 
is   used,   machine   efficiency  can  be   doubled  by  placing  rows  1  to  —  in  PE's 
0  to   31  and  rows  —  +  1  to  m  in  PE's    31  to   63.      This   improved  stora.ce  scheme 
requires  that   rows  75-  and  r  +  1  be  handled  as   special   cases.      The   idea 
behind  this  technique  can  be  used  on  any  mesh  which   doesn't  have  a  multiple 


2k 


of  6k    (128)  columns  when  straight  (odd-even)  storage  is  used.   Implementation 
of  such  techniques  increases  machine  efficiency  and  decreases  storage  require- 
ments per  mesh,  but  also  increases  the  complexity  of  the  program. 

2.k      Implementation  of  SLOR  on  ILLIAC  IV 
2.U.1  Parallelism  in  SLOR 

In  2.3.2  the  scheme  for  SOR  was  altered  to  maximize  parallelism  on 
ILLIAC  IV.   In  a  similar  manner,  we  will  modify  the  scheme  for  SLOR  to  maxi- 
mize the  number  of  columns  which  can  be  improved  simultaneously.   In  2.1+.2 
the  need  for  blocks  of  128  columns  to  perform  column  relaxation  with  optimal 
efficiency  on  ILLIAC  IV  will  be  explained.   This  leaves  no  restrictions  on 
the  number  of  rows  other  than  the  limits  of  available  core.   Analogous  to 
the  SOR  scheme,  note  that  column  i  can  be  improved  for  the  second  time 
after  column  i  +  1  is  improved  for  the  first  time.   By  taking  advantage 
of  this  fact  we  can  improve  half  of  the  columns  simultaneously  after  a 
finite  number  of  iterations.   Thus  once  we  reach  that  state  we  can  improve 
the  entire  mesh  in  two  steps   if  we  have  enough  PEs .   In  SOR  we  improve  the 
odd  (even)  points  and  then  the  even  (odd)  points.   In  SLOR  we  must  improve 
the  odd  (even)  columns  and  then  the  even  (odd)  columns.   The  proof  we 
used  in  2.1.2  to  show  that  we  could  improve  the  points  in  SOR  in  any  order 
can  be  extended  to  show  that  the  order  in  which  the  columns  are  improved 
SLOR  does  not  change  the  asymptotic  rate  of  convergence. 

u  .   =  h(u  !   +  u  *  .  +  uvu  '   +  u  :  .  -  wg  V.  .      23.1 

i,J        l+l,  J     i-l, J        i,J+l    i,j-l       i,J 

/       X  /  (£)       /  (£)        (£)    u 

i,j  1+1,  j      1-1,  j 

for  the  odd  columns   and 

u(Ul>   =  htu'^1?      ♦  U<*«b  +  -T  (U<Ui»   ♦  uf1*1')-  »«  *.     •  (23.2) 

1,J  1+1, J  1-1, J  1,J+1  l,j-l;  1,J 

+  (1  -  w     u:       -  h(u:'  .  +  u.  '  .)) 
i,j  1+1,  j       i--i».i 

for  the   even  columns. 
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2.U.2     Storage   for  SLOR 

Nov  we  need  to   find  a  storage  scheme  which   is   efficient   on   ILLIAC  IV. 
If  we   use   Straight   Storage,   we  run  into  the   same   inefficiency  we  did  with 
that   storage   in   implementing  SOR.      We  need  to   use  three  PE  integer   indices 
and  do  two  routes   per  mesh  point.      Furthermore,  the  method  works  on  blocks 
of  256   columns.      Column  J  and  column  J  +  1   cannot  be   improved  simultaneously. 
Thus  we   improve  the  odd   (even)    columns    in  the   set    {j|l  <  J  <   I  +  63}   and 
the   even    (odd)    columns    in  the   set    {j|l  +  128  <   J  <,   I  +  191}.      If  we  use 
Odd-Even  Storage  we  can  eliminate  one   route  and  two  of  the  PE  indices.      We 
also   decrease  the  blocks  to  128  columns   each. 

2.k.3     Improving  a  Line  by  the  Cuthill-Varga  Method 

The  method  of  implementing  SLOR  must    solve   a  tridiagonal  matrix 
equation   in   each  PE.      Thomas'   method   [Todd,    Page  395]    is   commonly  used. 
A  method  by   Cuthill   and  Varga  can  eliminate   some  of  the  work  by  doing  some 
preconditioning  if  the  tridiagonal  matrix  is  positive  definite  and  real 
symmetric.      Even   if  the  mesh   spacings   are  not    constant,  Poisson's   equation 
satisfies  those  conditions    [Cuthill     and  Varga,    Page  2Ul  ] . 

We  will   apply  the  latter  method,   the  Cuthill-Varga  method,    as    follows. 
Because  we   are  performing  successive  column  over-relaxation  we  will   use   a 
permutation  of  A,    PAP"   ,   to   discuss  the  technique.      The   equation  under 
consideration  will  be  -PAP~  PU     =  -PM. 
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-PAP 


-1 


where 


B  = 


B, 

C 

c, 

B,  C 

C,  B,    C 

\           \          \ 

\         \ 

C,    B,    C 

C,    B, 

C 

c, 

B 

— ■ 

- 

e, 

d 

e,    d 

N 

S 

c 

(2k) 


L,Ne,Nd 
d,    ej 


y  \  y       x 


th 


B  and  C   are  m-2  by  m-2  matrices.      Let   U.    and  M.    denote  the   i        column 


of  PU  and  -PM  respectively.      Now  using   (2k)   we  obtain 


BU.    =  M.    -   C(n     n    +  U.       ). 
l  i  l+l  l-l 


(25) 


By  assuming  the  left    side  of   (25)   is   constant   we  are  left  with   a  tridiagonal 
system  of  equations.   Because   B  satisfies  the   requirements   for  the   Cuthill- 
Varga  method,    [Cuthill   and  Varga,    Page  237],   we  can   define   a  diagonal 
matrix 


G  = 


'1 


'm-2  J 


such  that 
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G^BG-1   =   T'T, 


where 


* 


T  = 


1        tn 


1        t 
*■        ^x 

*  X 

1    I 


r3 


The  elements   of  G  and  T  are  calculated  as   follows 

1 
d    A    2 


1 


c     =   eZ;    c.   =    {e  -    (—^—)    }        ,    2   <   J    <  m-2, 
J  j-1 


and 


(26) 


tj    =    °j    Vl 


1   <  J    <  m-3. 


(27) 


By  multiplying 


BU.    =  M.    -   C(U.^    +  U.   _  ) 
l  i  i+l  l-l 


"by  G       we  obtain 


G"1BG"1GU.    =   G  1M.    -   G  1CG_1G(U.^n    +.  U.    ,  )    . 
l  l  i+l  i-l 


Substituting  Y.    for  GU.    gives 

G_1BG_1Y.    =   G-1M.    -   G"1CG~1    (Y.^  +  Y.       ) 
l  l  i+l        i-l 


or 


T  TY  =      G'^A .    -   G  """CG  X    ( Y. ^    +    Y.       ) 
i  l  i+l  i-l 


(28) 


Equation    (28)    is   first   solved  for  TY.    and  then   for  Y.. 

11 

The  Cuthill-Varga  method  employs   a  two   part    algorithm  to  perform  SLOR. 


First  Y.    is    calculated  using 


1  1  1+1  1-1 


(29.1) 


if  i   is   odd  and 


T     equals   the  transpose  of  T. 
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T'ty*.U+1)   =  G-V-   G^CG"1    (y!^1'   +  Y.^1')  (29.2) 

1  1  1+1  1-1  ' 

if  i   is   even. 

* ( &+1 )  ( & ) 

Then  Y.  is    combined  with  Y.        to  perform  the  over-relaxation, 

l  l 

(wi)  .b[;(wd  .t(*)]  +yU)   .  (30) 

l  ill 

After  all  the   iterations   are   completed,   Y.    is   converted  to  U.    using 

G-1Y.    =  U.. 
l  l 

2.U.1+  SLOR  Implementation 

Now  let  us  discuss  more  specifically  how  we  implement  SLOR  to 

XT  T  XT  T 

solve  — r-  +  — —     =   V.      Odd- Even  Storage   issued  for  the  reasons   stated  in 

6X  6Y 

Section  2.U.2.      We  use  the  Cuthill-Varga  method  and  thus   do  the   following 

preconditioning:    calculate   c,   1  £  i   <  m-2;   t.,   1   <   i   <,  m-3;   —  ,   1    <  i   <  m-2; 

and  —  ,    1   <   i   <  m-2.      We  use  PEM  storage   for  these  values   so  that  when 

(Vi> 

we  calculate  c.  we  can  do  6h   of  the  square  roots  simultaneously  because  the 

square  root  takes  a  great  deal  of  time  relative  to  other  operations  on 

**  1  —1 

ILLIAC  IV.   We  use  Grabone  '  to  access  the  c,  t.,  —  and  r— .   We 

1        i        i  f-u         \2 

v°i 
subtract   the      row  boundary   conditions   times  —     to  the   appropriate  row 

h 
—1  x 

of  y  and  then  we  multiply   by  G   .   We  multiply  column  boundary  conditions 
by  G  to  obtain  Y   and  Y  so  that  they  may  be  used  in  equations  (29)  and  (30). 
Now  we  are  ready  to  do  the  iterations.**"*"  Two  phases  are  used.   Phase  1 
improves  6U  columns  simultaneously  and  Phase  2  improves  the  remainder. 
Phase  1  is  used  when  the  number  of  interior  columns  is  greater  than  or 
equal  to  128.   Phase  1  starts  at  the  left  and  works  on  groups  of  128  interior  columns 


*  We  limit  our  program  to  handle  an  even  number  of  columns. 

**  The  GLYPNIR  function  is  used  to  access  a  single  PE  value  and  send  it 
to  all  PEs.   Refer  to  Table  9. 

***  The  reader  should  be  familiar  with  the  material  in  Appendix  A. 
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moving  to  the   right   until  the  number  of  remaining  interior  columns    is  less  than 

128.      Since  there  are   elements   of  63   interior  odd  columns  on  the   first 

line  of  PEM  containing  interior  points  we  use  PI  and  CI   as   defined  in 

2. 3.h.      m-2   rows   of  temporary   storage,    STORE,  are  used.    We  use  two 

similar  codes   for  Phase  1:      one   for  the  odd  columns   and  one   for  the   even 

columns.      Phase   2   can  use  the  same   code   as   Phase  1    if  the  mode   is   changed 

at  the  appropriate  time. 

We  will  limit   our   discussion  to  the  code   for  the  odd  columns,   Phase 
1.      The  rest   of  the  code   can  be   found  in  Appendix  D. 


STORE[0]:=  *[PI+Cl]-GRABONE(DISB[0],0)*(RTR(l,,u[CI+l] )+U[PI+CI+l]); 
LOOP  CJ:=   1,1, CM3  DO  BEGIN  C:=  CJ. [l6:42] ; 
CI:    =  CI+K; 

STORE   [CJ]:=  ^[PI+Cl]-GRABONE(DISB[C],CJ)*(RTR(l, ,U[CI+l] )+ 

U[PI+CI+1])-GRAB0NE(E[C],CJ)*ST0RE[CJ-1]; 

END; 


(31) 


LOOP  MC:=1,1,CM3   DO  BEGIN 
CJ :  =CM3-MC ; 
C:=CJ.[l6:U2]; 

ST0RE[CJ]:=ST0RE[CJ]-GRAB0NE(E[C],CJ)*ST0RE[CJ+1];  (32) 

END; 


*  -1  -1 

DISB  contains  —  ,  ¥  contains  G  M  and  E  contains  t.. 

<Vi>2 
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irfiHAt  conditions'  m  mum  mmmm 


Register 


$A 
$X 


Contents 


ST0RELCJ-1J 
PI 


STEP  NO. 


ASSIGNMENT  STATEMENTS 


$B:  =  GRAB0NE(E[CJ,CU) 

$A:  =  $A*$B 

$S:  =  \|r  [$X](0)-$A 

$R:  =  RTR(1„U(2)) 

$A:  =  U[$X](2)+$R 


=  GRABONE(DISB[C],CU) 
=  $A^$R 
=  $S-$A 


|  $A 

STORE (l):  =  $A 


Register 


$C0 
$C1 
$C2 


Contents 


CI 
CJ 
CI+1 


OPERATIONS  AND  OVERLAP 


TIME  IN 
CLOCKS 


Load  10 

Real  multiplication  9 

(PE  fetch  overlapped  by  mult.)     j  7 

Real  subtraction 

(PE  fetch  overlapped  by  subtract.)   3 

Route 

(PE  fetch  partially  overlapped 

by  route) 

Real  addition  11 

Load  10 

Real  multiplication  9 

Real  subtraction  7 

PE  store  '  7 


aubtotai 


IT 


INITIAL  CONDITIONS  FOR  BACKWARD  ELIMINATION 


Register 


Contents 


Register 


"$C0" 


Contents 


"|a~ 


ST0RELCJ+1J 


CJ 


STEP  NO. 


ASSIGNMENT  STATEMENTS 


OPERATIONS  AND  OVERLAP 


ITIME  IN 
CLOCKS 


$B 
$A 
$A 


=  GRABONE(E[C],CJ) 

=   $A*$B 

=  ST0RE(0)-$A 


STORE(O):    =  $A 


Load  10 

Real  multiplication  !    9 

i 
(PE  fetch  overlapped  by  mult. )  ;    7 

Real  subtraction  ■ 
PE   store  7 


Subtotal 
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INITIAL  CONDITIONS  FOR  OVER-RELAXATION 


Register 


Contents 


Register 


Contents 


$X 

$co 


PI 


$C1 
$C2 


cu 
w 


STEP  NO. 


ASSIGNMENT  STATEMENTS 


$R:  =  UL$XJ(0) 
$A:  =  ST0RE(l)-$R 

$A:  =  $C2*$A 


U[$X](0):  = 


OPERATIONS  AND  OVERLAP 


.TIME  IN 
CLOCKS 


PE  fetch 

PE  fetch 

Real  subtraction 

(CU  fetch  overlapped  by  sub. 

Real  multiplication 

Real  addition 

PE  store 


14 

9 
7 


Subtotal 


W 


TOTAL  TIME 


150 


Table  h.      Timing  of  an  Execution  of  Phase  1  SLOR 
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LOOP  CJ:=0,1,CM3  DO  BEGIN 

ST:=U[PI+CI]; 

U[PI+Cl]:=W*(STORE[CJ]-U[PI+Cl])+U[PI+Cl]; 

IF  ABS(ST-U[PI+CI])  GRT  BOUND  THEN  B:  =1;  %   CHECKING  THE  ERROR  BOUND  (33) 

CI:=CI+K; 

END; 

SLOR  has  three  primary  sections:   (31)  does  the  forward  elimination,  (32) 
does  the  backward  elimination,  and  (33)  performs  the  over-relaxation. 
In  Table  k   we  suggest  how  to  code  these  sections  efficiently  in  ASK  and 
we  give  a  time  estimate  for  that  code. 

2,5   Implementation  of  API  on  ILLIAC  IV 

ADI  improves  all  columns  simultaneously  and  then  improves  all  rows 
simultaneously.   Thus, when  we  do  the  column  (row)  improvement  we  want 
the  values  of  the  boundary  columns  (rows), but  the  boundary  rows  (columns) 
could  have  been  added  to  vector  \|/  in  preconditioning.   One  solution  is  to 
include  both  the  row  and  the  column  boundary  values  in  each  iteration. 
Another  solution  is  to  add  them  both  in  when  preconditioning  and  then  set 
the  boundary  values  to  zero  so  they  will  not  affect  the  interior  points 
when  used  in  the  iteration.   If  we  need  the  values  of  the  boundary  after 
the  problem  has  converged  then  we  must  store  them  before  performing  any 
of  the  iterations  and  replace  them  after  all  the  iterations  are  completed. 
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2.5.1  Thomas'  Method  on  ILLIAC  IV 

As  in  SLOR  we  have  sets  of  tridiagonal  matrices  to  solve.  We  used 
the  Cuthill-Varga  method  in  SLOR.   This  method  does  preconditioning  to 
cut  down  on  the  computational  time  (see  Section  2.U.3).   AD I  changes  the 
diagonal  at  every  iteration.   Thus  the  preconditioning  would  have  to  be 
done  at  each  interation.   Thomas'  Method  [Todd,  page  395]  has  the  same 
problem  but  the  preconditioning  requires  less  computer  time  and  the  total 
computer  time  per  iteration  is  smaller.   Thus  we  will  use  Thomas'   Method 
in  solving  the  tridiagonal  matrix  problems.   Instead  of  implementing  the 
method  in  a  straight  forward  manner,  we  use  the  fact  that  the  horizontal 
coefficients  and  the  vertical  coefficients  are  constant. 

The  matrix  problem*  we  want  to  solve  is  of  the  form 
aT1  +  b  T2  =  D1 


bT 


±_±   +  aT.  +  bT.+1  =  D.   (i  =  2,    3,  ...,  n-3), 


bT    +  aT  0  =  D 
n-3     n-2    n-2 

where 

2  1 

a  =  -("" 2"  +  w   ),   b  =  —   for  row  relaxation 

*  y 

and  ,2  \  ,  1 

a  -  -  \~~~2  +  w.J,        b  =  —       for   column  relaxation 

h  h 

x  x 

Thomas'  method  is  as  follows 

ei'i>e±=^rr-     ci-2, 3, .... -3),  m 

1-1 
D-,        D.    bq. 

ql-"T'   qi  °  a-b"e.  "     (i-2f  3,  ...,  n-2),  (35) 

l-l 

Tn-2  =  ^-2*  Ti  =  qi  "  ei  Ti+1   (i  =  n'3>    n~k>      ">    1}'  (36) 


* 


We  are  considering  an m  by  n  mesh,  U. 
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If  we  let  c  =  -  we  can  rewrite  the  equation  (35)  as  follows: 


b 


q.  =  e±   c  Dv      q_±  =  e±    (c  D.  -  qj,_1)  (i  =  2,3,  •-,   n-2), 

and  e    =    b .   This  reduces  the  number  of  divisions  which  are  re- 

n-2   a-b  e   0 
n-3 

latively  time  consuming  on  ILLIAC  IV. 


We  have  to  calculate  the  e.'s  and  store  them.   For  each  row  (column), 
they  are  the  same.   Calculating  the  e.'s  is  a  serial  process.   The  only 
way  we  could  overlap  is  to  calculate  the  set  of  e.'s  for  different  itera- 
tions simultaneously.   The  problem  with  this  solution  is  that  it  requires 
extra  storage.   If  we  calculate  all  the  e.'s  needed  for  32  iterations  we 
use  max  (m-2,  n-2)  rows  of  storage.   If  storage  is  available,  then  this 
is  a  feasible  method;  but  if  we  must  use  disk  I/O,  the  computational 
time  saved  will  be  overcome  by  increased  i/O  time. 


2.5.2  Skewed  Storage 

The  next  problem  we  must  overcome  is  how  to  access  the  rows  on 
one  sweep  and  then  the  columns  on  the  next.   Skewed  storage  was  designed 
just  for  that  purpose.   We  can  place  a  m  by  n  mesh  in  skewed  Storage  by 
performing  the  following  algorithm: 

1)  Calculate  the  number  of  rows  of  PEM  required  to  store 
n  points.   Set  K  equal  to  that  number. 

2)  Place  the  m  by  n  mesh  in  straight  storage. 

3)  Route  each  of  the  K  rows  of  PEM  containing  elements  of 
mesh  row  i.   Let  the  distance  of  the  route  be  i-1. 
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Figure  4 .   A  96  by  96  Mesh  in  Skewed  Storage 
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AD I  is  a  simultaneous  method,  (i.e.  only  values  from  the  previous 
iteration  are  used,   see  (l5.l)  and  (15.2)).   On  ILLIAC  IVwe  can  simul- 
taneously improve  up  to  6h   rows  (columns).   Thus  if  the  dimensions  of 
the  mesh  are  less  than  67  we  can  do  all  the  rows  (columns)  in  one  time 
step.   If  one  of  the  dimensions  is  greater  than  66  then  we  need  to  use 
more  than  one  time  step  for  row  or  column  relaxation.   In  this  case  we 
will  use  two  phases  to  complete  an  iteration.   Anm  by  n  mesh  will  be 


divided  using  R  = 


m-2 


n-2 


jj—     and  J  =  j-rr— j  •   Phase  1  row  (column)  relaxation 

will  improve  R  (j)  groups  of  6h   consecutive  rows  (columns).   Then  Phase  2 
will  improve  the  remaining  interior  rows  (columns).   Phase  1  starts  with 

row  (column)  2  and  works  toward  row  (column)  m-1  (n-l).   When  Phase  1  im- 

th 
proves  the  I   group  where  1  <  I  <  R  (j)  and  k  is  the  first  row  (column) 

of  group  I,  the  old  values  of  row  (column)  k-1  are  needed.   Thus  to  im- 
prove group  I  Phase  1  goes  through  the  following  steps: 

1)  The  values  of  row  (column)  k-1  are  placed  in  the 
temporary  storage,  NEW. 

2)  The  values  in  the  temporary  storage,  OLD  are  placed 
in  the  PEM  storage  for  row  (column)  k-1. 

3)  The  values  or  row  (column)  k+63  are  placed  in  the 
temporary  storage,  OLD. 

h)     Rows  (columns)  k  through  k+63  are  improved. 

5)  The  values  in  NEW  are  placed  in  row  (column)  k-1. 

When  Phase  1  is  improving  the  first  group  steps  1,  2  and  5  can  be 
omitted.   Phase  2  needs  to  do  steps  1,  2,   k   and  5* 
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Now  we  will  discuss  the  type  of  indexing  used  in  accessing  the  mesh  points 
required  to  perform  Phase  I  column  relaxation  using   skewed  storage.   We  ac- 

cess  the  set  { (^ .  .,  U.  .  )|k  <  j  <  k  +  63}   by  using  a  permutation  of 
1*  3         1*  <J 

the  PE  integer,  (1,0,0,  •••,0).   PI  will  stand  for  that  permutation. 
PI  will  be  used  as  a  PE  index  and  CI  will  be  the  CU  index.   PI  must  be 
routed  one  PE  to  the  right  to  access  the  set  {U.  ,  . |k  <  j  <  k+63} .   PI 
must  be  routed  one  PE  to  the  left  to  access  {U.    . |k  <  j  <  k+63}  •   To 

access  the  set  (U.  .   |k  <  j  <   k+63}  we  must  add  RTR(l,,Pl)  to  PI.   The 

I*  J+  -L 

set  {U.  .   |k  <  j  <  k+63}  can  be  accessed  using  only  CI.   After  being 
ij  3~  ■*- 

accessed  the  four  neighbors  of  U.  .  must  be  routed  to  the  PE  containing 

U.  .  before  U.  .  can  be  improved.   The  following  GLYPNIR  code  uses  this 
1*3         1*3 

method  of  data  access  to  perform  Phase  1,  Step  k   of  column  relaxation: 
LOOP  C:=l,l,N-2  DO  BEGIN  CC:=  C.  [16:1*2];  CI:=CI+K; 
U[PI+Cl]:=(AI*(\|r  [PI+Cl]-BETA*(RTL(l,  ,U[PI+RPI+Cl]  )+RTR(l,  ,U[Cl]  ) ) 
+CA*U[PI+Cl])-RTR(l,,U[LPI+CI-K]))*GRABONE(B[C-l],CC);  (37)** 

LPI:=PI;PI:=RPI;RPI:=RTR(l,,RPl);END; 

LOOP  IC:=0,l,N-3  DO  BEGIN  C:=N-2-IC; 

CC:+C.  [16:1+2];  CI:=CI-K; 

U[PI+CI] :=U[PI+CI]-GRAB0NE(B[C-1],CC)*RTL(1,,U[RPI+CI+K]);  (38) 

RPI:=PI;   PI:=RTL(l,,Pl);END; 


*  The  variable,  k  equals  the  first  column  of  the  group  being  improved. 

2  2 

**  AI=-h   ;   BETA=  l/h     ;W  equals  the  relaxation  parameter  for 
y  x 

this  iteration;  CA=2*BETA+W. 
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The  code  is  divided  into  two  sections:   Section  (37)  does  Thomas' 
forward  elimination.   Section  (38)  does  Thomas'  backward  elimination; 
Table  7  suggests  how  this  code  could  be  translated  into  ASK  and  how 
much  time  the  ASK  code  would  take. 

By  replacing  PI  with  KPEN=RTR(l, ,PENK)  where  PENK=(0,K,2k, . . . ,63K) 
and  K  is  the  number  of  lines  of  PEM  required  to  store  a  row  of  the 
mesh,  we  would  have  Phase  1  row  relaxation.* 

A  complete  program  for  a  m  by  n  mesh  where  n,  m  <  64  is  found  in 
Appendix  E. 


*  In  row  relaxation  a  PE  index  must  be  used  to  access  (U.  .  , Ik  <  j  <  k+633 

1,0-1'  -   - 
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Table  5.  Timing  of  an  Execution  of  Phase  1  ADI 
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2.6  Summary  of  the  Results  for  Iterative  Methods 

Now  that  we  have  examined  implementation  of  the  various  methods  we 

can  make  some  recommendations  concerning  the  use  of  the  methods.   Table  6 

contains  a  numerical  study  of  convergence  rates  of  the  method  applied  to 

Poisson's  equation  where  h  =  h  =  -^  on  different  mesh  sizes.   The 

x    y      j_p 

error  bound  was  0.0001. 
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a)  co  is  the  relaxation  parameter 

b)  I  is  the  theoretical  number  of  iterations 

c)  I  is  the  actual  number  of  iterations* 

a 

d)  2n  is  the  number  of  relaxation  parameters  used  in  ADI 
Table  6.   A  Numerical  Study  of  the  Methods  under  Discussion 

We  can  see  from  Table  6  that  ADI  is  consistently  the  fastest**  then 
comes  SL0R,  with  S0R  being  slowest.   This  conforms  with  theoretical  results. 
In  fact,  in  a  square  mesh  where  h  =  h  ,  SL0R  converges  J2     times  faster  than 
S0R  while  ADI  converges  3  times  faster  than  S0R.   B  is  a  monotonically 
increasing  function  of  the  number  of  mesh  points  [Todd,  Pages  396-398]. 


** 


Remember  ADI  improves  the  points  twice  in  each  iteration. 

The  speed  of  the  algorithms  is  compared  with  respect  to  the  number 
of  iterations . 
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To  compare  the  time  required  to  perform  one  of  the  above  methods  on 
ILLIAC  IV  to  that  required  to  perform  the  method  on  a  specific  machine  we 
need  to  calculate  the  clocks  per  point  per  iteration  and  compare  it  with  the 
clocks  per  point  per  iteration  on  ILLIAC  IV.  We  have  shown  that  SOR,  SLOR, 
and  ADI  can  be  programmed  to  enable  ILLIAC  IV  to  improve  64  points  simulta- 
neously for  certain  mesh  sizes.   By  dividing  the  clocks  required  to  improve 
64  points  simultaneously  by  6k   we  obtain  the  clocks  per  point  per  iteration 
on  ILLIAC  IV.   Table  7  contains  these  values  and  other  pertinent  information. 


ADI 

SLOR 

SOR 

Number  of  operations  per 
point  per  iteration 

Mult. 

10 

k 

k 

Add. 

10 

6 

5 

Time  in  clocks  per  64 
points  per  iteration 

Arithmetic 

160 

7S 

71 

Data  Access 

164 

72 

22 

Total  clocks  per  point  per  iteration 

5.06 

2.34 

1.45 

Optimum  mesh  size** 

641  by  6k L 

1281  by  L 

641  by  2L 

Rows  of  PEM  required  for  temporary 

Max 
(n-2,m-2) 

m-2  +,.  fo-2) 

e 

storage 

h       6k 

Table  7«   Summary  of  Time  and  Storage  Requirements  for  the  Iterative  Methods 
on  a  m  by  n  Mesh. 


Most  of  the  data  access  time  on  ILLIAC  IV  could  be  eliminated  if  it  was  a 
serial  machine.   Let  Machine  A  be  a  serial  machine  with  a  processor  similar  in 
speed  to  a  PE   but  with  the  ability  to  perform  ADI,   SOR,  and  SLOR  with  neg— 
ligible  data  access  time.     From  Table  7  we  can  see  that  Machine  A  would  take 
71/1.45,  about  48,  times  longer  to  perform  SOR; than  ILLIAC  IV  would  take. 
Machine  A  would  take  about  32  times  longer  than  ILLIAC  IV  to  perform  ADI  or 
SLOR.    .  ...... 


*     For  maximum  efficiency  on  ILLIAC  IV,  one  dimension  of  the  mesh  must  be 
a  multiple  of  64  for  SOR  and  SLOR  while  both  dimensions  must  be 
multiples  of  64  for  ADI. 

**    I  and  L  are  positive  integers 

***   A  single  PE  has  approximately  twice  the  arithmetic  speed  of  a  CDC  6600. 

****  The  clocks  per  point  per  iteration  on  Machine  A  equal  the  arithmetic 
time  in  clocks  per  64  points  per  iteration  on  ILLIAC  IV. 
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The  times  in  Table  7  are  assuming  the  inner  loops  of  the  programs  are 
written  in  ASK.   If  the  programs  were  written  completely  in  GLYPNIR  the 
times  would  about  double. 

Tables  6  and  7  indicate  that  ADI  is  the  fastest  iterative  algorithm 
on  ILLIAC  IV  for  rectangular  meshes.   The  theory  for  ADI  has  only  been 
developed  for  special  cases  [Young,  Page  555 ] .   If  we  have  a  commutative 
case,  HV  =  VH,  then  ADI  will  be  effective.   The  commutative  space  requires 
a  rectangular  region  and  coefficients  of  the  differential  equation  which 
are  sufficiently  regular  [Young,  Page  538],   In  some  non commutative  cases, 
numerical  experiments  have  shown  ADI  to  converge  rapidly.  This  is  hot 
always  true.   In  some  noncommutative  cases  ADI  fails  to  converge  for  cer- 
tain parameters  [Young,  Page  5^6],   More  work  needs  to  be  done  on  non- 
commutative  cases  before  ADI  can  be  used  freely  on  them. 

ADI  performs  row  relaxation  followed  by  column  relaxation.   Thus  to 
efficiently  perform  ADI  on  a  m  by  n  mesh  on  ILLIAC  IV,  both  m  and  n  must 
be  divisible  by  6k.      For  example  if  we  have  a  l6  by  6k   mesh  when  we  per- 
form row  relaxation,  ILLIAC  IV  will  be  working  at  25%   machine  efficiency 
while  for  column  relaxation  ILLIAC  IV  would  be  working  at  100$  machine 
efficiency.    If  one  had  four  l6  by  6k   meshes  to  solve,  ILLIAC  IV  could 
solve  the  four  meshes  simultaneously  and  thus  brings  the  machine  effi- 
ciency up  to  100$. 

SLOR  converges  in  fewer  iterations  than  SOR  and  can  be  programmed  to 
take  about  the  same  amount  of  time  per  iteration  as  SOR  on  most  serial 
machines.   On  ILLIAC  IV  SLOR  requires  data  from  one  PE  broadcast  to  all 
the  PEs  while  SOR  does  not.   This  is  the  major  factor  in  causing  SLOR  to 
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take   about   1.5  times  longer  per   iteration  than  SOR,    see  Table   7.      Thus 
SOR  is   a  faster  method  on  ILLIAC   IV  unless    it   takes   at   least   1.5  times 
more   iterations   to   converge.      This    is   rarely  the  case.      SLOR  requires 
considerably  more  temporary   storage  than  SOR.      Finally  SLOR  is   a  block 
iterative  method  and  thus    cannot  be   as   easily  reformulated  as   SOR  for 
efficient   performance  on   ILLIAC   IV. 

As   shown   in   Section  2.3  of  this    study,    SOR  has    a  few  constraints 
on  how  it  must  be   implemented.      This   enables   one  to  program  it    effi- 
ciently on   ILLIAC   IV  for  most   mesh  geometries.      SOR  seems  to  be  the  most 
promising  of  the   iterative  and  direct   methods   examined  by  the   author 
for  non-rectangular  meshes. 

In  this    study,   we   indicated  that   the  order  in  which  the   elements 
are   improved  by  SOR  does  not    affect   the   asymptotic   rate  of  convergence. 
This    can  be  misunderstood.      The  number  of  iterations   required  to  obtain 
a  specified  error  bound  is    dependent   on  the  order  in  which  the   elements 
are   improved.      For  specific   initial   conditions,    one  ordering  might   give 
a  faster   initial   rate  of  convergence  than   another.      In  some   applications 
of  Poisson's   equation  the  actual  number  of  iterations   required  for  con- 
vergence  is    substantially   smaller  than  theoretically  expected  because   a 
good  initial   guess    is    supplied.      In  these  cases   the  ordering  has   a  greater 
affect   on  the   computer  time  required  to   get   an   acceptable  solution.      ADI 
and  SLOR  require   a  considerable  amount   of  preconditioning  before  the 
iterative  process    can  begin.      Thus   if  the   initial   guess   is   good  enough, 
SOR  will  take  the  least  amount    of  computer  time  on  any  mesh. 
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This   study  examined  straight   and  odd-even  storage   for  SOR.      The 
use  of  odd-even   storage   improved  the   speed  of  SOR  by  10%.      In  most 
problems,    straight   storage  is   used.      Thus,    for  a  valid  comparison  of 
the  two   storage   schemes,   the  time  to   convert   straight    storage  to  odd- 
even   storage  must  be   considered.      This  takes   about  the   same  amount   of 
time   as   one  SOR  iteration.        Thus  the  program  must   require   at   least 
20  iterations  to  justify   converting  to  odd-even  storage  and  then  back 
to  straight   storage. 
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3.      IMPLEMENTATION   OF  DIRECT  METHODS  ON   ILLIAC   IV 

3.1      Introduction  to  Hockney's   Direct   Method    (FACR) 

Direct  methods    for  solution  of  a  restricted  class   of  Poisson's 

equation  have  "been   developed  which  are  faster  than  any  iterative  method 

developed  to   date    [Dorr,   Pages   258-259 ]•      To  the  author's  knowledge, 

Hockney's   Fourier  Analysis /Cyclic  Reduction    (FACR)    is  the   fastest 

direct  method  on   serial   machines    [Hockney,    Page  159].      In  this   paper 

we  will  examine  FACR  and  modify  it   so  that    it   can  be  programmed 

efficiently  on   ILLIAC   IV.      FACR  solves  the   "five-point"   difference 

formula  on   a  rectangular  mesh;    namely, 

u     i    *    -2U      .    +  U   ._    .  U     .    _    -2U  +  U       ^ 

s-l,t  s,t  s+l,t  s,t-l  s,t  s,t+l  w  ,      v 

v2  v2  ~     s,t  Uyj 

h  h  ' 

x  y 

N 
where  the  number  of  points  being  computed  in  one   direction   is   2     -1   for 

N 
Dirichlet's  boundary   conditions,    2      for  periodic  boundary   conditions   and 

N 
2     +1    for  Neumann's  boundary   conditions.      These  three  boundary  conditions 

are  permitted  in  the  x  and  y   direction,    giving  nine  possible   combinations 

of  boundary   conditions. 

A  2     by  2     mesh  produces   a  linear  system  with  2  unknowns.      Let 

M  =   2     and  N  =  2    .      FACR  solves  this   system  using  the   following  five  step 
algorithm: 

1.  Given  V   compute   ¥      for  the  even  columns    and  overwrite  ¥  with  y     on 
the   even   columns. 

2.  Using  ¥      compute  ¥      and  overwrite  Y     with  y    . 

s  s  s       s 

3.  Using  V  solve  for  U  and  overwrite  ¥  with  U  . 

h.      Using  U  compute  U  on  the  even  columns  and  overwrite  U  with  U  on  the 
even  columns. 
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5.      Using  the  values   of  U  on  the  even  columns   solve    (39)    for  the  values 
of  U  on  the  odd  columns   and  overwrite  ¥  with  U  on  the  odd  columns. 

The   superscripts   are   for  Dirichlet's  "boundary  conditions   and  are 
defined  as   follows: 

VI  =     Vl/  _ 

Ti,j     *i,j+l 


M-l 
2     V   ....  .      uik 


s  c.        p        m.  .      irxft- 

fi,J   =  M     ^   K,l    Sln  TT 

,J  k*l     k,J  M 


Odd/even  reduction  divides  the  problem  into  two  parts;  first  to  solve  a 
linear  system  concerning  even  columns  and  secondly  to  solve  for  the 


values  on  the  odd  columns.   The  even  columns  form  a  linear  system  with  M^- 


i  N 

unknowns  which  is  solved  by  steps  2,  3,  and  h.      Step  5  involves  the  -^ 

linear  systems  with  M  unknowns  that  give  the  values  for  the  odd  columns. 


Fourier  analysis  decouples  the  even  columns  into  M  tridiagonal  linear 

N  s 

systems  with  —  unknowns.   The  unknowns  are  the  Fourier  harmonics,  U  ,  of  U. 

Recursive  cyclic  reduction  solves  the  linear  systems  for  the  Fourier 

g 
harmonics  U  . 


s 
Fourier  synthesis  converts  the  Fourier  harmonics  U  obtained  by  the 

previous  step  to  U  giving  the  values  of  U  on  the  even  columns. 


The  values  of  U  on  the  even  columns  are  used  to  calculate  the  values 

N 
of  U  on  the  odd  columns.   This  involves  solving  —  tridiagonal  linear 

systems  with  M  unknowns  [Hockney,  Pages  1U8-153]. 


H6 


Refer  to  Appendix  B  for  a  more  detailed  explanation.   Note  that  all  five 
steps  can  "be  performed  using  one  mesh  containing  the  "boundary  conditions  and 

it    S     S 

y   initially.   This  is  overwritten  by  f  ,  T  ,  U  and  finally  the  solution  U. 

We  will  discuss  the  method  in  three  parts,  steps  1  and  5,  step  3, 
and  steps  2  and  h.      Suggestions  will  he  made  for  the  application  of 
each  part.   Finally  we  will  present  a  program  in  GLYPNIR  similar  to 
FACR  and  make  suggestions  on  how  it  can  he  programmed  for  ASK. 


3.2  Odd/Even  Reduction  and  Odd  Column  Solution 

Odd/even  reduction  is  used  to  cut  down  on  the  number  of  columns 
where  Fourier  analysis  and  synthesis  is  applied.   Consider  the  three 
neighboring  equations: 

i  ut-2  +  BUt-i  +  :r  ut  ■  Vi 

h  h 

y  y 

7Vx+    BUt+TVi  -^  (k0) 

h  h 

y  y 

~   U   +    BU,     +  -|  \J  =    y 

h2  t       t+1   h2   t+2     *t+1 

y  y 

where  U  is  the  t   column  of  the  mesh  and  t  is  even.   By  multiplying 

2 
the  middle  even  line  equation  by  -Bh  and  adding  we  obtain 

»7 


h2  Ut-2  + 

y 


2  \  _l 

h2  -  B2  h2  I  Ut  +  h2  U    =  I    -  Bh2  T  +  f    =  ,  *  (1,1) 

\  y      /     y  J 


Note   equation    {hi)    contains  only   even   columns   of  U,   that    is   U     0,   U 
and  Up.      Thus   after   {hi)    is    solved  the  algorithm  must    solve   for  the 
values   of  U  on  the  odd  columns. 


hi 


To  apply  odd/ even  reduction  one  needs  an  odd  number  of  mesh  points 
for  Dirichlet  or  Neumann  boundary  conditions  and  an  even  number  of  mesh 
points  for  periodic  boundary  conditions. 

Odd/even  reduction  halves  the  number  of  columns  that  require  Fourier 
analysis  and  synthesis.   The  odd  columns  are  solved  as  tridiagonal 
matrix  problems,  step  5.    Cyclic  reduction  is  used  to  solve  for  the 
odd  columns  of  U.   This  has  the  potential  of  saving  computational  time. 
Hockney  showed  that  one  level  of  odd/ even  reduction  reduces  the  total 
number  of  operations  per  point  on  a  serial  machine  from  3^  to  2k   on  a 
128  x  128  mesh  [Hockney,  Page  l6l]. 

Let  k  be  the  number  of  interior  columns  in  the  mesh.   On  ILLIAC  IV 
Fourier  analysis  and  synthesis  can  be  applied  to  6k   of  these  columns 
simultaneously.   Thus  without  odd/even  reduction  a  Fourier  algorithm 

■jt-  (times.    If  odd/even  reduction  is  used  a  Fourier 
algorithm  must  be  applied  iTpo"  I  times. 

Odd/even  reduction  is  of  no  use  if  k  <  6k   because  then  \rr\  =  Rt^o"  J  • 
If  k  >  6k   then  \tt~]  >\T^p'\   an(^  odd/even  reduction  would  reduce  the 
number  of  times  Fourier  analysis  and  synthesis  must  be  applied.   The 
best  case  is  where  f  rr-J  =  2  [y^a   J  in  which  case  the  number  of  Fourier 
analysis  and  synthesis  is  cut  in  half. 

Straight  storage  is  the  standard  storage  scheme  used  by  most  pro- 
grams. If  odd/even  reduction  is  used,  maximum  machine  efficiency  is 
obtained  when  an  even  column  is  contained  in  each  PE  for  the  odd/even 


fP]  equals  the  integer  L  where  L  -  1  <  P  <  L. 
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Figure    5.      Storage   Schemes  Used  on  Rows  k  and  5   of  a 
96  by  96  Mesh   if  Odd/Even  Reduction  is  Used 
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reduction  step,  the  Fourier  analysis  step  and  the  Fourier  synthesis  step; 
CRED  can  access  a  different  row  in  each  PE;  and  during  the  solution  of 
the  odd  columns  an  odd  column  can  be  accessed  by  each  PE.   The  following 
storage  changes  will  fulfill  all  these  conditions  assuming  each  row  of 
the  mesh  requires  an  even  number  of  rows  of  PEM. 

1.  We  assume  the  main  program  supplies  U  in  straight  storage. 

2.  Before  odd/even  reduction,  the  even  rows  of  PEM  containing 
U  are  routed  one  PU  to  the  right  of  the  odd  rows. 

3.  Before  CRED,  U  is  placed  in  skewed  storage. 

h.      Before  Fourier  synthesis,  U  is  returned  to  the  storage  used 
for  odd/ even  reduction. 

5.   The  mesh  U  is  placed  in  straight  storage  at  the  end  of  the 
subroutine. 

See  Figure  5  for  clarification  of  the  above  storage  schemes.   If  odd/ 
even  reduction  is  not  used  steps  2  and  h   of  the  storage  changes  need  not 
be  executed.   The  use  of  odd/even  reduction  requires  the  extra  storage 
changes  because  steps  one  through  four  of  FACR  are  applied  to  even 
columns  while  step  five  solves  the  odd  columns.   Without  the  use  of 
odd/even  reduction  steps  2,  3,  and  h   in  Section  3.1  would  be  applied  to 
all  the  columns. 

3.3  CRED 

In   step   3,    section   3.1  we  have   a  tridiagonal   system  of  equations  to 
solve.      The   derivation  of  the  coefficients   for  Dirichlet's  boundary 

conditions    is   contained  in  Appendix  B.      By  multiplying  equation    (B-8)    in 

2  th 

Appendix  B  by  h       the   diagonal   coefficient   for  the  k        row  becomes 
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k  k          2                                       h                 P 

h              «  .  ,.h        h .      x               ,           h              h  ' 

X]t..|2^_co.ai  .8  (-^*z^\ 

x  x       "x                                      "x              "x 


,hH     li        N  .  li  ti  i 

vh         h        '  h  hi 


Hockney   solves  these  tridiagonal   systems   using  cyclic   reduction  recursively, 
Given 

It-*  +  \  lt-2 +  It  -\zK,t-i 

uk,t +  \  uk>t+s  +  uk,t^  =  \2  yk,t+i 

where  t    is   even. 

By  multiplying  the  middle   equation  by  -A^and   adding  we  obtain 

US        .    +    (2  -  A2)    US        +  US        ,    =  h  2    (V  +  V  -  A^  s      ) 

k,t-l+        v  k;     k,t  k,t+U  y      v   k,t-2         k,t+2  k,t' 

(1*3) 

Now  the  process    is   repeated  on  the   equations   until  we   are  left  with   just 
one   equation.      Then  the  process    is   reversed   [Hockney,    Page  150].      This 
method  uses   LogpN  extra  words   of  storage,    ^N  -  2LogpN  additions,    2N 
multiplications   and  Log  H  divisions    for  an  N  variable   system.        It    also 
requires  that    N  equal  2     -  1   if  the   system  has   Dirichlet   conditions,    2 
if  the   system  has  periodic   conditions,    and  2     +  1   if  the   system  has 
Neumann   conditions. 

For  Dirichlet  boundary   conditions   Thomas'   method  allows  one  to 
solve   a  tridiagonal   system  of  N  variables  where  there   are  no   restrictions 
on  N,    (See  Section   2.5.1)-      In  the   case  where  the   diagonal   elements    are 
equal   and  all  the  other  non-zero   elements    equal   1,  we  calculate 


* 

N  is  the  number  of  interior  even  columns  in  the  mesh. 
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■'%,].'  X„-e,    .   n  ft>ri-2.3.....H.  (1.1, ) 

k  '  k     k,i-l 


'k,l        Xv    '      k,i        A^-e 


"k.1  =  "k.l  ek,l    ••   \,i  '   {%r  «k,i-l>   ei     for  *  "  2'3 N-    (1,5) 


In  "  Vr  li=  <k.i"  ek,i  li+i    for  ±  =  N-1'1*-2 x-       <W) 


This   method  uses   N  words  of  extra  storage,    N  divisions,    2N  multipli- 
cations,   and  3N  additions.      On   ILLIAC   IV  a  division  requires   as  much  time 
as   six  multiplications.  Thus  the  calculations   of  the  e.'s   takes  most   of 
the  time  required  for  Thomas'   method.      Since  the  e.'s   are  independent   of 
U  and  ¥  they   could  be  calculated  in  preconditioning  and  used  repeatedly 
by  Thomas'    method.      Since   each  row  has   a  different    \     we  need  a  different 

K. 

set  of  e,  .  's  for  each  row  of  U.   Thus  all  the  en  .  's  would  require  half 
k,i  k,i 

as  much  storage  as  U  if  odd/even  reduction  is  applied.   Thomas'  method 
can  be  applied  to  an  arbitrary  M.   This  not  only  allows  the  program  to  be 
more  general  but  in  some  cases  can  save  storage  in  placing  U  and  m   in  PEM. 
If  a  problem  with  Dirichlet  or  Neumann  boundary  conditions  required  the 
accuracy  obtained  by  6h   by  6k   meshes,  applying  cyclic  reduction  would 
require  the  meshes  to  be  65  by  65  while  Thomas'  method  would  accept  65 
by  6h   meshes.   In  straight  storage  a  65  by  65  mesh  requires  130  lines  of 

PEM  while  a  65  bv  6U  mesh  reauires  65  lines  of  PPM.   If  th«  boundarv 
conditions  are  periodic,  cyclic  reduction  would  require  a  6k   by  6k   mesh, 
and  would  require  only  6k   lines  of  PEM.   The  extra  storage  required  to 

store  the  e   . 's  for  Thomas'  method  is  too  great  for  some  memory  bound  problems, 
By  changing  (k6)   to 

TT  —  TT  \\t  ^  TTS 

Uk,N  "   V      k,N-l   "    \,N  "   \   Uk,N; 

Uk,i   =  \,i+l   "   \  Uk,i+1   "  Uk,i+2        f0r  ±   =  N"2>    N"3'    -—I- 
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we  eliminate  the  need  for  extra  storage.  We  will  refer  to  this  method 
as  modified  Thomas'  method.  It  takes  N  divisions,  2N  multiplications, 
and  4N  additions. 

Table   8   summarizes  the  storage   and  computational  time  of  cyclic 
reduction,    Thomas'   method  and  modified  Thomas  method  where  the   divisions 
are  done   in  preconditioning  for  Thomas'   method  and  cyclic  reduction   and 
the   divisions   are  not   done   in  preconditioning  for  the  modified  Thomas'   method 
The   computational  time   is   calculated  by  multiplying  7,    9   and  56   clocks  by 
the  number  of  additions,   multiplications    and  divisions   respectively  and 
is   for  6h  rows  of  U.      The  storage  is   in  words  not  lines  of  PEM. 


Mesh  Size 
Restrictions 

Extra  Storage 
Requirements 

Time  per  6k 
Rows  in  Clocks 

Thomas'  Method 
with  odd/even 
reduction 

An  odd  number 
of  columns 

Thomas  ' 
Method 

NM/2  words 

39N/2 

Mod.  Thomas' 
Method 

No  words 

102N/2 

Thomas'  Method 
without  odd/even 
reduction 

No 
Restrictions 

Thomas  ' 
Method 

NM  words 

39N 

Mod.  Thomas  ' 
Method 

No  words 

102N 

Cyclic  Reduction 


N  =  21  +  1 

—            ., 

MLog  N  words 

U6N  -  lUlog  N 

Table   8.      Comparison  of  Three  Methods   to  Solve   Tridiagonal  Matrix 

Equations   for  Dirichlet's  or  Neumann's   boundary   conditions, 


Odd/ even  reduction   is  most   helpful  when  f~")  =  2   fj|g~]    where  k  is   the 
number  of  interior  columns    (see  Section  2).      The  time  estimates  of  Table   8 
for  Thomas'    method  with  odd/ even  reduction  are  only  half  the  corresponding 
values   for  Thomas'   method  without  odd/even  reduction.      This   is  because 
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odd/ even  reduction  decreases  the  size  of  the  tridiagonal   system  to  "be 
solved  by   a  factor  of  2. 

For  cyclic  reduction  the  amount   of  extra  storage   in  Table  8  is 
misleading.      If  I   >  6  then  6hu  words  of  extra  storage  are  required  per 
mesh  in  PEM.      If  both  U  and  V  are   in  PEM,then  128M  words   are  required. 
This   is  because  cyclic  reduction   requires   2     +1   columns    for  Dirichlet's 
or  Neumann's  boundary  conditions  while  a  similar  error  bound  could  be 
obtained  using  Thomas '   method  and  2     points. 

For  cyclic   reduction  on  Neumann's  boundary   conditions,    if  I   >  6 
then  an  extra  Fourier  analysis   and  synthesis  must  be  performed  to  solve 
the  last   column.      This  time  should  be  divided  by  the  number  of  rows   and 
added  to  the  time  per  row  for  cyclic  reduction  on  Neumann's  boundary 
conditions. 

If  odd/ even  reduction   is   used  then  the  odd  columns   of  U  are   calculated 
by   solving  tridiagonal   systems.      These   systems   are  already  restricted  in 
size  by  the  FFT  performed  on  the  even  columns.      One   cyclic   reduction  pro- 
gram can  be  written  to  solve   for  U  on  the  odd  columns   for  all  three 
boundary   conditions.      Thus   cyclic   reduction  will  be  used  to   solve   for  U 
on  the  odd  columns  when  odd/even  reduction   is   employed. 

3.^     Fourier  Analysis   and  Synthesis 

Hockney  wrote  one  routine  which  performs   Fourier  analysis  or  synthesis 
for  any  of  the  boundary  conditions    [Hockney,   Page   201].      His   routine   assumes 
that  there   is   plenty  of  temporary  storage  available.      Although  this   is  true 
on  most   large   serial  machines   ILLIAC   IV  has   a  relatively  small   amount   of 
storage  compared  with   its   speed.      Thus   a  method  which   is   as    fast   as 
Hockney' s  Fourier  routine  but   does  not   require  extra  storage  would  be 
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preferred.      Cooley,   the  author  of  FFT,   presents   such  a  method  in 
[Cooley,    Page   320-323].      Cooley  shows  that   it   is  possible  to  use  the 
complex  fast  Fourier  transform  to  obtain  the   sine  series    (required  for 
Dirichlet  boundary  conditions),      cosine  series    (required  for  Neumann 
boundary   conditions)    and  real   series    (required  for  periodic  boundary 
conditions).      Both  Hockney's   and  Cooley 's   methods  place   restrictions 
on  the   interior  mesh  sizes:      2     -1   for  Dirichlet,    2      for  periodic   and 
2+1   for  Neumann. 

Cooley' s   algorithm  for  Dirichlet  boundary  conditions   goes   as 

follows:      let  M  =  2    ,   Y(j )    for  j    =1,    ...,   M-l  be  the   coefficients   of 

M-l 
the  Fourier  sine  series,   b(k)   =     J     Y(j  )   sin  -jjr-  . 

J=l 

Y(0)   =  Y(M)   =  0,    and  Y(j)   =  -  Y(2M-j)   =  -  Y(-j)    for  j   =  0,    .  . . ,   M. 

Define 

X(j)    =   -[Y(2J+1)    -   Y(2j-1)]    +  Y(2j)i   for  j    =   0,    ...,   M/2. 

Thus 

X(0)   =  -[Y(l)   +  Y(-l)]   +  Y(0)   =  -2Y(1)   +  Oi 

and 

X(M/2)   =  -[Y(M+1)    -   Y(M-l)]    +  Y(M)i   =   2Y(M-l)   +   Oi. 
For  j   =1,    ...,    M/2-1, 

X(M/2+j)   =  -  Y(M+2j+l)   +  Y(M+2j-l)   +  Y(M+2j )i  =  Y(M-2j-l)    -  Y(M-2j+l) 
-  Y(M-2j)i. 

Let  X(j)   =  C(J)    and  calculate  A^j)    and  A^)    for  j   =   0,   1,    ...,   M/k 

using 

Vj)   -  C(j)   +  C(M/2+j)    and  Ag(j)  =    [C(j)    -  C(M/2+j)]Wj 

where 

wm  -  cos  Sii  + 1  sn  aa . 

Ihen  calculate  A(j)    and  A(M/lt+j  )    for  j    =0,    1,    ...,   M/!(-l 
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where 

A(j)   =  A1(j)    +   i  A2(j),   A(M/U+j)    =  A1    (j)    -   iAg(j) 
Thus,  for  j    =  0,    1,    ...,    MA, 


Eli 


A(j)    =  -Y(2j+1)    +  Y(2j-1)    +  Y(M-2j-l)    -  Y(M-2j+l)    +  SIN  ^  [Y(2j+l) 
-   Y(2j-1)    +  Y(M-2j-l)    -  Y(M-2j+l)]    +   COS  ^L  [y(2J  )    +  Y(M-2j)] 
+   i  {Y(M-2j)    -   Y(2j)    +  SIN  ^(Y(2j)    +   Y(M-2j)]   +  (U8.l) 

COS  2l±      [Y(2j-1)    -  Y(2j+1)   -   Y(M-2j-l)    +  Y(M-2j+l)]} 

and  for  J    +1,    . . . ,    U/k 

A(M/2-j)    =   Y(2j-1)    -  Y(2J+1)    -  Y(M-2j+l)    +  Y(M-2j-l)    - 

SIN  ^li  [Y(2j+1)    -  Y(2j-1)   +  Y(M-2j-l)   -  Y(M-2J+l)]    - 
COS  2il   [Y(2j)    +  Y(M-2j)]    -   i{Y(M-2j)    -  Y(2j)    - 


SIN  ^-  [Y(2j)    +  Y(M-2j)]   -   COS  ^f-  [Y(2J-l)    -  (1+8.2) 


^-  [Y(2j)    +  Y(M-2j)]    -   COS  ^i 
Y(2j+1)    -  Y(M-2j-l)    +   Y(M-2j+l)]}. 


Now  the   complex  fast   Fourier  transform  is    applied  to  A   giving 
N/2-1 
X(j)   =        I       A(k)W^,    for  j   =  0,   1,    ...,   M/2-1. 
k=0 

For  j   =  1,    3,    5,    ...,   M-3  let 

b(j)  =  X  (ji|— )  imaginary  and 

b(j+l)  =  X  (^p~ )  real;  b(m-l)  =  X(—  -1 )  imaginary. 

Finally  b(j)  for  j  -  1,  ...,  M-l  is  calculated 

fcD(j)  =  [Mj)  -  b(M-J)]  -  [b(j)+b(M-j)]/[2Sin  *£]..  (1+9) 

M-l 
Now  we  have  Ub(j)  =  1+   J  Y(k)  Sin  ^jj-  . 

k=l 

Fourier  analysis  calculates  —  b(j)  and  Fourier  synthesis  calculates  b(j). 
Since  we  are  dealing  with  linear  equations  we  can  multiply  p  in  (l)  and 
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TIME   B 
CLOCKS 

1 

P[$X](3):=$R-$S 

PEM  Store,  Addition 

Ik 

1 

$C3:=$C3-$D2 

CU  Addition  overlapped 

0 

1 

$B:=P[$X](2) 

PE  Fetch  overlapped 

0 

1 

$A:=$Sj$S:=$B 

Two  PE  Register  Transfers 

2 

1 

P[$X](2):=$P+$A 

Addition,   PE  Store 

Ik 

1 

$C2:=$C2+$D2 

CU  Addition  overlapped 

0 

3°(¥>IS 

1 

$R:=P[$X](2) 

PE  Fetch  overlapped 

0 

P[$X](2):=$R+$R 

Addition,   PE  Store 

1U 

$C2:=$DO+$D5-$D2 

CU  Addition  overlapped 

0 

$S:=P[$X-(2) 

PE  Fetch  overlapped 

0 

$C3:=$D5 

CU  Register  Transfer  overlapped 

0 

P[$X](3):,-$S-$S 

Addition,   PE  Store 

Ik 

$C3:=$C3+$D2 

CU  Addition  overlapped 

0 

$R:=P[$X](3)           #ODD3 

PE  Fetch  overlapped 

0 

P[$X](3):=-$R-$R 

Addition,   PE  Store 

14 

TNT 

$CO:=$D3 

CU  register  transfer  overlapped 

0 

^y 

2 

I 

$C3 : =$C3+$D3 ;$C2 : =$C3-$D2 

CU  Addition  overlapped 

0 

2 

$R:=P[$X](3)-$R 

PE  Fetch  overlapped,  Addition 

7 

2 

$D6: =GRABONE (DS [GL] ,N2-IL) 

CU  Load 

10 

2 

$D7 : =GRABONE (DS [GL] ,IL) 

CU  Load 

10 

2 

$S : =$D6*$R 

Mult  ipl i  c at  i on 

9 

2 

$A:=P[$X](2)*$D7 

PE  Fetch  overlapped,  Multiplication 

2 

$S :  =$A-$S            </00DD2 

Addition 

7 

2 

$R:=$R*$D7 

Multiplication 

9 

2 

$A:=P[$X](2)*$D6 

PE  Fetch  overlapped 

9 

2 

$R:=$A+$R           $ODDl 

Addition 

7 

2 

$C3:=$DO+$DU-$CO 

CU  Addition  overlapped 

0 

2 

$CO : =$CO+$D3 ; $C1 : , $C3 -$D3 

CU  Addition  overlapped 

0 

2 

P[$X](2):=P[$X](3)-P[$X](1) 

Two  PE  Fetches,   One  overlapped 
Addition,   PE  Store 

21 

2 

$C2 : =$C2+$D2  j$Cl : =$C1+$D2 

CU  Addition  overlapped 

0 

2 

ODD3:=P[$X](2) 

PE  Fetch  overlapped,   PE  Store 

7 

2 

P[$X](2):=$S-P[$X](1) 

PE  Fetch,  Addition,   PE  Store 

21 

2 

P[$X](l):=$S+P[$X](l) 

PE  Fetch  overlapped,  Addition,   PE 
Store 

Ik 

2 

$C2:=$C2-$D2 

CU  Addition  overlapped 

0 

2 

$S:=P[$X](2) 

PE  fetch  overlapped 

0 

2 

P[$X](1):4S-$R 

Addition,  PE  Store 

11+ 

2 

$B:=$R 

PE  Register  Transfer  overlapped 

0 

2 

$R:=ODD3 

PE  Fetch  overlapped 

0 

M.i      nr 

2 

P[$X](2):=$S+$B 

Addition,  PE  Store 

Ik 

$C2 : =$D4+$D1 ; $C3 : =$D4+$D5 

CU  Addition  overlapped 

0 

$R:=P[$X](2) 

PE  Fetch  overlapped 

0 

P[$X](2):=P[$X](3) 

PE  Fetch,   PE  Store 

1U 

*KI 

P[$X](3):=$R+$R 

Addition,  PE  Store 

Ik 

SUBTOTAL 

(57M-155)f^1 

Table  9-   Preparation  of  the  Data  for  the  Fast  Fourier  Transform* 


"■The  logic  for  the  code  to  keep  current  values  in  the  ADB's  defined  in 
the  initial  condition's  is  supplied  in  GLYPNIR,  Appendix  B. 
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INITIAL  CONDITIONS 

REGISTER    CONTENTS    REGISTER 

CONTENTS     REGISTER    CONTENTS 

REGISTER 

CONTENTS 

$X 

PI        $D2 

K          $D5          ILJ 

$D8 

12 

$DO 

J         $D3 

KP          $D6          IP 

$D9 

CJ 

$D1 

L         $Dk 

KM          $DT          Jl 

$D10 

IT 

LOOP 

t 

i/VSSIGNMENT  STATEMENTS 
j 

| OPERATIONS  AND  OVERLAP 

TIME  IN 
CLOCKS 

TOTAL  TIME 
FOR  LOOP 

$C2;=$D5 

CU  Register  transfer  overlapped 

0 

0 

3 

$C3:=$C2+$D1 

!CU  Addition  overlapped 

0 

3 

$S:=P[$X](2) 

'PE  Fetch  overlapped 

0 

3 

P[$X](2):=$S+P[$X](3) 

:PE  Fetch,  Addition,  PE  Store 

21 

3 

P[$X](3):=$S-P[$X](3) 

jPE  Fetch  overlapped,  Addition,  PE 

3 

$C2:=$C2+$D2 

[  Store 

|CU  Addition  overlapped 

1 

lit 

0 

i 

3^)&1  ; 

$CO:=$D3 

j 

;CU  Register  Transfer  overlapped 

0 

10(^) 

$C2 ; =GRABONE ( INDEX [ GL ] , IL ) ; 

CU  Load 

10 

1* 

$R:=P(0) 

PE  Fetch 

7 

i 

1+ 

P(0):=P(2) 

jPE  Fetch,  PE  Store 

lit 

3 

I* 

P(2):=$R 

'PE  Store 

7 

t 

1+ 

$CO ; =$CO+l ; $C2 ; =$C2+1 

CU  Addition  overlapped 

0 

28'¥»ra 

$D11 : =GRABONE ( DS [ GL ] , N2-N11 ' 

| 

*$D8 

CU  Load;  Multiplication 

19 

$D12:=GRAB0NE(DS[GL],Nll) 

CU  Load 

L_______---________ 

10 

29(M-1) 

5 

$C2:=$Di++$D7+$D5 

r 

CU  Addition  overlapped 

0 

5 

$C3 : =$C2+$D9+$D9 ; $C1 : = 

$C3+$C2 

CU  Addition  overlapped 

0 

5 

$R:=P[$X](3)*$D11 

PE  Fetch  overlapped,  Multiplication 

9 

5 

$S:=P[$X](1) 

PE  Fetch  overlapped 

0 

5 

$B:=$S*$D12 

Multiplication 

9 

5 

$R : =$R-$B     #ODDl 

Addition 

7 

5 

$S:=$S*$D11 

Multiplication 

9 

5 

$A:=P[$X](3)*$D12 

PE  Fetch  overlapped,  Multiplication 

9 

5 

$S : =$A+$S     #0DD2 

Addition 

7 

5 

P[$X](3):=P[$X](2)-$R 

PE  Fetch  overlapped,  Addition,  PE 
Store 

lit 

5 

P[$X](2):=P[$X](2)+$R 

PE  Fetch  overlapped,  Addition,  PE 
Store 

11+ 

5 

$C2:=$C2+$D2 

CU  Addition  overlapped 

0 

5 

P[$X](1):=P[$X](2)-$S 

PE  Fetch  overlapped,  Addition,  PE 

5 

P[$X](2):=P[$X](2)+$S 

PE  Fetch  overlapped,  Addition,  PE 

lit 

Store 

lit 

106(Log2(M)-2) 

- 

«™ 

SUBTOTAL       (^|(M+l)Log  (M)+51 

«-•"»$!♦ 

3UM-5U 

Table  10.  The  Complex  Fast  Fourier  Transform* 


The  logic  for  the  code  to  keep  current  values  in  the  ADB's  defined  in  the 
initial  conditions  is  supplied  in  GLYPNIR,  Appendix  B. 
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INITIAL  CONDITIONS 

REGISTER    CONTENTS    REGISTER 

CONTENTS    REGISTER 

CONTENTS 

REGISTER 

CONTENTS 

$X           PI         $DO 

J          $D1 

I 

$D2 

ILJ 

LOOP   ASSIGNMENT  STATEMENT 

OPERATIONS  AND  OVERLAP 

TIME  IN 
CLOCKS 

TOTAL  TIME 
FOR  LOOP 

$D3 : =GRABONE( IS [ GL ] , 11 ) 

CU  Load 

10 

io(M|i) 

6 

$C1:=$D2+$D1;$C2:= 

$D0-$D1+$D2 

CU  Addition  overlapped 

6 

$R:=P[$X](1)  . 

PE  Fetch  overlapped 

6 

$B:=P[$X](2) 

PE  Fetch 

6 

$R:=$R=$B 

Addition 

6 

$A:=P[$X](1) 

PE  Fetch  overlapped 

6 

$C3:=$D3 

CU  Register  transfer  overlapped 

6 

$A:=$A+$B 

Addition 

6 

$S:=$A*$C3 

Multiplication 

6 

P[$X](l):=$R+$S 

Addition,  PE  Store 

6 

P[$X](2):=$S-$R 

Addition,  PE  Store 

SUBTOTAL 


0 
0 
7 
7 
0 
0 
7 
9 
lit 
Ik 


58<^>[£l 


29MlaT2?l6Fr 5*5 


Table  11.    Final  Step  in  Cooley's  Fourier  Analysis  and  Synthesis* 


*The  logic  for  the  code  to  keep  current  values  in  the  ADB's  defined  in 
the  initial  conditions  is  supplied  in  GLYPNIR,  Appendix  B. 
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INITIAL  CONDITIONS 

iREGISTER 

CONTENTS 

REGISTER 

CONTENTS 

REGISTER 

CONTENTS 

REGISTER    CONTENTS 

$X 
'  $DO 

PI 
CI 

$D1 
$D2 

IL 
L2 

$D3 
$Dl+ 

L3 
K 

$D5         Lk 
$D6        KM1 

'loop   ASSIGNMENT  STATEMENTS 

OPERATIONS  AND  OVERLAP               ; TIME  IN   ' TOTAL  TIME 
1 CLOCKS     FOR  LOOP 

7    $C2:=$Dl;$C3:=$DO          CU  Register  Transfer  overlapped 
!  7    P(2):=RTR($C3,,P(2));       PE  Fetch,  Route,  PE  Store 

£            |20(M-1)^ 

-            __ 

$C1 :-0;$C3:+$D0            CU  Register  clear  overlapped 

$S:=1.0/A(3)               PE  Fetch,  division 

$B:=P[$X]*$S               PE  Fetch  overlapped,  Multiplication 

--------- 

63         r  -, 

$C1:=$C1+1 

$A:=RTR($C1,,AI 

$S:=RTR(l,,$S) 

$X;=RTR(l,,$X) 

$R : =RTR ( 1 , , $B ) 

$A:=$A-$S 

$S:=1.0/$A 

$A:=P[$X]-$R 

$B;=$A*$S 


%E 


CU  Addition  overlapped 

PE  Fetch,  overlapped,  Route 

Route 

Route 

Route 

Addition 

Division 

PE  Fetch  overlapped,  Addition 

Multiplication 


6 
3 
3 
3 
7 

56 
7 
9 


9U(k-i: 


$A:=RTR($C1,,A(3)) 

$R:+$B 

$A:=$A*$B 

$C2:=$D6 

$S;=P[$X](2)-$A 

P[$X](2):=$R 


PE  Fetch  overlapped,  Route 

PE  Register  transfer  overlapped 

Multiplication 

CU  Register  transfer  overlapped 

PE  Fetch  overlapped,  Addition 

PE  Store  overlapped 


22&1  : 


9  $C2:=$D2+$D3;$C1:=$C1-1 

9  $B:=RTL(1,,$R)     #Q 

9  $S:=RTL(l,,$S)     %E 

9  $X:=RTL(1,,$X)      %TI 

9  $A;=RTR($C1,,A(3)) 

9  $R : =$B     %q 

9  $A:=$A*$S 

9  $A;=P[$X](2)-$A 

9  $A;=$A=$R     %T 

9  P[$X](2):=$S 
9   |  $B:=$S     %q 

9  $S:=$A      %E 


'CU  Addition  overlapped 

Route 

Route 

Route 

PE  Fetch,  partially  overlapped 

Route 

PE  Register  transfer  overlapped 

Multiplication 

PE  Fetch  overlapped,  Addition 

Addition 

PE  Store  overlapped 

PE  Register  transfer 

PE  Register  transfer 


0 
3 
3 
3 

10 
0 
9 
7 
7 
0 
1 
1 


U6(M-i) 


sl 


#C2:=$D1;$C3:=$D0 
P(2):=RTL($C3,,P(2)) 


CU  Register  transfer  overlapped 
PE  Fetch,  Route,  PE  Store 


20(M-1) 


SUBTOTAL 


l^QN  \^]-h6  f^j 


+  i+0(M_i) 


Table  12.   CRED* 


^-The  logic  for  the  code  to  keep  current  values  in  the  ADB's  defined  in 
the  initial  conditions  is  supplied  in  GLYPNIR,  Appendix  B. 
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the  boundary   conditions  "by  -^r  .      Then  apply   (1+8.1),    (1+8.2),   the  complex 
fast   Fourier  transform  and   (1+9)  before. and  after  CRED.      This   algorithm, 
MFACR,    is   a  modified  FACR  which  solves    (39)    for  Dirichlet's  boundary 
conditions  without   using  odd/ even  reduction. 

3.5      Implementation  of  MFACR 

We  will   discuss  MFACR  in   explaining  how  it   can  be  programmed  in 
ASK  and  how  much  time  such  a  routine  would  take.     Appendix  F  contains 
GLYPNIR   code  of  the  algorithm.     Appendix  A  contains   information  on 
timing  methodology  and  explains  the  notation  used  in  Tables   9  through 
12.      Both  V  and  U  are  M+2  by  N+2  mesh.        Y  and  U  are  stored  in  straight 
storage  in  C   and  P,    respectively.      MFACR   consists  of  five   steps: 

1.  Set   up  storage  area  P 

2.  Fourier  analysis  uses   ¥  to   calculate  Y 

3.  CRED  uses   YS   to  calculate  US 

1+.      Fourier  synthesis   uses  U     to   calculate  U 

5.      Restore  boundary   conditions   and  C. 
Step  one  places   a  linear  combination  of  the  boundary  conditions   and  ¥ 
in  the  interior  of  the   storage  area  P.      In  the  process  the  first   row 
of  U  along  with  the  second  and  N+lst   row  of  ¥  are  lost.      Thus  they  are 
saved  in  temporary  storage.      This   section  of  the  algorithm  takes 
(23M+102)  [ ^-~)    +  13M  clocks    in  ASK.      Section  1  of  Appendix  F  contains 
the  GLYPNIR   code. 

Fourier  synthesis   and  analysis   are  performed  on  up  to  61+  columns 
simultaneously  described  above.      They  are  presented  in  three  parts. 
First   equations    (1+8.1)   and   (1+8.2)  prepare  the  data  for  the  complex 
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* 

fast   Fourier  transform.      Table  9      contains    suggestions   on  how  these  equa- 


as 


tions  could  be  programmed  in  ASK  and  supplies  (57M-155)  FT  clocks 
a  time  estimate  for  this  step.  Table  10  contains  suggestions  on  how  the 
complex  fast  Fourier  transform  could  be  programmed  in  ASK.  A  time  esti- 
mate of  (^  (Mfl)  Log2(M+l)  +  5  M-79)  Ig^l  +  3UM  -  5^  is  given  for  this 
step.      Finally  the   implementation  of  equation    (i+9 )    in  ASK  is    discussed 

in   Table  11  where   a  time   estimate  of  29(M+l)    \tt\     +   5M  +   5   is  presented. 

53 
The  total   time  estimate   for  Cooley's   algorithm  is    {^r-  (Mfl)    Logp(M+l) 

+  91M  -  205}    tr-1   +  39M  -  1+9   clocks.      The   GLYPNIR  code   for  Cooley's 

method  is   found  in   section  2  of  Appendix  F. 

MFACR  uses   the  modified  Thomas'   method,    equations    (1+1+) ,    (1*5),    and 
( i+T )   to  perform  CRED  on  up  to   6k   rows    simultaneously.      Table  12   suggests 
how  this  method   could  be  programmed  in  ASK  and  estimates  how  long  that 
code  would  take,    Uo(M-l)    \Jf\   +    (ll+0N-li6)    \Jf\      clocks.      Section   3  of 
Appendix  F  contains   the  GLYPNIR  code  for  CRED. 

The   final   step  of  MFACR   consists   of  restoring  the  boundary   condi- 
tions  and  the  portions   ¥  which  have  been  altered.      Section  h  of  Appendix 
F  contains   the  GLYPNIR   code   for  this   step.      If  programmed  in  ASK  this 
step  would  take  approximately  1+2    \Tr\  clocks. 

The   complete  MFACR   requires    (53/2(M-l)    Log    (M+l)    +  2*+5M-306  }    |gr-| 
+    (ll+0N-l+6)     -tt-I   +  91M   -  98    clocks.      The  timing  algorithm  is   taken  from 
Appendix  A. 


The  loop  numbers    in   Tables  9  to  12   correspond  to  the  loop  number  in 
the   GLYPNIR  code   in  Appendix  F.      They   are  supplied  to   enable  the 
reader  to   compare  loops    in  the  GLYPNIR  code  with  the   corresponding 
loops    in  the  ASK  code. 
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3.6  Implementation  of  FACR 

FACR,   one  level   of  odd/ even  reduction  would  require    (53/2(M-l)  log   (M+l) 
+  1+05M  -  k66)      k§^)   +    (80M  +  10U)    Qfl   +    (TON  -  23)    Qf)    +  81M 
-   Qk   clocks.      This    figure  assumes  the  odd  columns   are  solved  by  Thomas' 
method  with  the  e's   calculated  and  stored  in  preconditioning.      Thus 
the  odd  columns  would  take    (8kM  -  k6)     ^J    clocks.      Setting  up  the 
even   columns  would  take  T9M    U    .      Preparing  the  odd  rows   for  solu- 
tion would  take   55M   [^j    clocks.      CRED  would  take    (70N-23)    |^| 
clocks.      Fourier  analysis   and  synthesis  would  take    (53/2(M-l)   log2    (M+l) 
+  192M  -  420)  |t§k       +  68M  -  U8.      The   extra  storage  manipulation 

would  take  1TM    frr-"]   clocks.      The  rest   of  the  algorithm  is   identical  to 
MFACR. 

3.7  Summary  of  the  Results  on  Direct   Methods 

In  programming  an   algorithm  on   ILLIAC   IV,    an  attempt    is  made  to 
adapt  the  restrictions   of  the   algorithm  to  maximize  storage  and  machine 
efficiency.      For  efficient   storage  on  ILLIAC   IV,    rows   stored  across  PEM 
should  contain  a  multiple  of  6k  words.      To  maximize  machine   efficiency 
the  number  of  values  "being  computed  simultaneously  equals   6k.      Hockney's 
method  performs   a  Fourier  analysis   and  synthesis   on  each   column  and 
performs   CRED  on  each  row.      Thus  to  maximize  machine   efficiency  on 
ILLIAC   IV  the  number  of  interior  rows   and  the   number  of  interior 
columns  must  be  multiples  of  6k.      Fourier  analysis   and  synthesis 
using  Cooley's  method  requires   2-1  points  per  column   for  Dirichlet's 
and  Neumann's  boundary  conditions.      If  I   >  6  then  storing  the  columns 
across   PEM  would  waste  63  words  per  column.      Thus  we  will   store  each 
column  in  one   PE.      Now  we  have   columns   restricted  to  2   +1    (2    )    for 
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Dirichlet's  or  Neumann's    (periodic)   boundary   conditions.      Rows   are 
restricted  to  multiples  of  6k  to  maximize  storage  efficiency.      For 
periodic  "boundary   conditions    cyclic  reduction   saves   time  and  temporary 
storage  over  Thomas'   method   [Hockney,    Page  150],   and  restricts   the 
mesh  to   2      columns.      This   restriction  on  mesh  size  still   allows  maxi- 
mum storage  and  machine   efficiency.      Cyclic   reduction  restricts   the  mesh 
to  2     +1   for  Dirichlet's   and  Neumann's  boundary  conditions        This 
restriction  reduces   storage  efficiency  for  both  boundary   conditions   and 
it   reduces  machine  efficiency  for  Neumann's  boundary   conditions. 
Thomas'   method  and  the  modified  Thomas'   method  have  no  restrictions   on 
the  number  of  columns.      Thomas'   method  is   faster  than  the  modified 
Thomas'   method  but   requires  temporary  storage.      Refer  to  Table   8   for  a 
more   detailed  comparison  of  the  methods   available   for  use  by   CRED. 

Hockney 's   FACR  program  handles  the  9   different   combinations  of 
boundary  conditions   and  performs   one  level  of  odd/even   reduction   in   all 
cases.      To  maximize   efficiency  of  such   a  program  on   ILLIAC   IV  different 
algorithms  would  be  required  for  different  mesh  sizes   and  combinations 
of  boundary  conditions. 

As   stated  earlier,    odd/ even  reduction  saves    computer  time  on 
ILLIAC   IV  for   certain  mesh  sizes  but   not    for  others.      Thus   FACR  on 
ILLIAC   IV  should  have  two   algorithms,   one  which  uses  odd/even  reduction 
and  one  that  does  not.    Furthermore,   the   choice  of  a  method  to   solve 
CRED  depends  on  the  storage  requirements   of  the   individual  problem. 
It    is   the  author's   opinion  that  modified  Thomas'   Method  is  the  best 
compromise  between  storage   and  speed  for  Neumann's   and  Dirichlet's 
boundary   conditions,    and  cyclic  reduction   is  the  best   for  periodic 
boundary  conditions. 
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If  Fourier  analysis    and  synthesis    is   applied  to   columns  which  have 
Neumann's  boundary   conditions  then  there  must  be  2     +1   interior  rows 


to  be   evaluated  by  CRED.      This  means   CRED  must  be  executed    |   >.     J 


f2  +i1  2 

times.      Note    \—7\ — |      -  tt  +1    if  I   >   6   and  thus    for  one   execution  of 


CRED,    ILLIAC    IV  is  working  at   1/6U        of  its   capacity.      The  user  can 
avoid  this  machine   inefficiency  by  setting  up  the  program  so   that 
Fourier  analysis   and  synthesis   is   performed  in  the   direction  which  has 
non-Neumann  boundary   conditions. 

On  a  M  by  N  mesh  the  Fourier  part  of  FACR  requires  the  order  of 
NMLogpM  clocks  when  it   is   applied  to  the   N  columns    in  the  mesh.      CRED 
requires  the  order  of  NM  clocks.      By  letting  M  be  the   smaller  of  the 
dimensions  the  user   saves   computer  time. 
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k.      USE  OF   ILLIAC   DISK  FOR  NON-CORE  CONTAINED  MESHES 

When  the  problem  being  run  becomes  too  large  to   contain   in   core, 
the   I/O  time  becomes   an   important   consideration.      First  the   I/O   system 
for  ILLIAC  IV  "will  be  discussed  and  then  its   efficiency    for  a  specific 
problem  will  be  examined. 

k.l      ILLIAC   IV  I/O   System 

The  ILLIAC   IV  Disk  is   a  15,600  K  word  memory.      This  memory  is 
divided  into   52  bands.      Each  band  is    divided  into   300  pages,    each   con- 
taining 16  lines   of  PEM  (l  K  words).      Memory  transfers  between  the 
ILLIAC   IV  Disk  and  PEM  are  restricted  to  pages.     A  data  request    can 
read  or  write  up  to  128  consecutive  pages   on  one  band.      If  the  data 
is   not    consecutive  or  if  it   is   spread  over  more  than  one  band  a 
separate  request   is  needed  for  each   string  of  consecutive  pages.      The 
time  required  for  the   disk  to  prepare  to  perform  a  data  request   is 
equivalent  to  the  time  to  transfer  two  pages  of  data  between  the  Disk 
and  PEM.      Thus   if  a  data  request   reads  or  writes   on  page  i   and  band  j, 
the  next   data  request   should  skip  at   least  two  pages  and  start   at  page 
i  +   3  of  band  k,   where  j   need  not   equal  k.      If  the  second  data  request 
wanted  page   i  +  1,    ori  +  2,   the  request  would  have  to  wait   for  a  revolu- 
tion of  the  disk  before  it   could  be  executed.      The  transfer  rate  between 
PEM  and  Disk  is  133  usee  per  page  and  the  disk  revolves  once  every  U0 
msec.      One  revolution  of  the   disk  is   equivalent  to   6^0,000  ILLIAC  IV  clocks. 

k.2     1/0   for  the  Bernard-Rayleigh   convection  problem 

Now  let   us    examine  the  Bernard-Rayleigh  convection  problem  where  the 

following  equations   are  used  to  solve   for  temperature,   T,   pressure,   P, 
and  velocity   components  u  and  w  along  with  the  X  and  Z  axes: 
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3T  9T  j         3T  1      „2m  ... 

-rr+u-r-  +  w^-=—  VT  (50) 

3t  3x  3z  P  w    ' 

r 

dr\  3n  3n         Ra  3T     _2  ,      , 

r 

3w        3u       „2 .  .      . 

n=^-^=V^  (52) 

where  t    is   time,    R     is   the  Rayleigh  number,    P     is  the  Prandtl  number,    and 

a  r 

^   is  the   stream  function   defined  by 

3iJj  3ip 

u  =   -  r-1-  ,        w  =  — 

3z  3x 

The   finite  difference   schemes   employed  to   solve  these   equations   require 
five  meshes:      T(l),    T^"^,    r/0,    n{x'l]    and  *(t). 

Each  time  step  of  the   convection  process  has  three  main  parts; 
calculate   r\  ,    calculate  T  and  calculate   ¥ 

ri-h30    is   calculated  using  ^  [?  M+1,    ^ ,   1^]    .,    and  ?!Tj    .    v    T^ 

X>J  'X±L>3      'i,j+i'     <i,  j     '     i+l>3'  1+1,3+1       i,j 

is   calculated  using  T:T;     .,   T  T     .,   T^T1'  ¥  :  Tn     .    ,,  Y.n     .    andY.     .    ,. 

1+1,3'      1,3+1       i,J  1+1,3+1'      i+l,  J  1,3+1 

Using  all  of  n  ,^.+1     is   calculated. 

1  i,3 


The  process  will  be   divided  into  two  parts.      First  the  prognostic 

■  ain  n 
,(rU) 


equations    (50)    and   (5l)    are   solved  to  obtain  r\  and  T  .      Then 


Poisson's   equation,    (52),    is   solved  for  T 

The  prognostic   equations   require   five  meshes  while  Poisson's 
equation   requires  two  meshes    if  solved  by   an  iterative  method  and  one 
mesh   if  solved  by   a  direct  method.      Thus  non-core  contained  problems 
can  be   divided     into  two  types.      In  the   first   type,   the  meshes   required 
for  solving  Poisson's   equation   are   core   contained  but    some  of  the  meshes 
required  for  solving  the  prognostic   equations   are  not    contained  in   core. 
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In  the   second  case,   the  meshes   are  so  large  that   none  of  the  meshes    can 
be   core   contained. 

In  [Ogura,  et  al]  the  first  type  of  problem  was  studied  in  "which  SOR 
was  used  to  solve  Poisson's  equation.  If  a  direct  method  had  been  used  in 
that    study,   the  maximum  mesh   size  could  be  doubled. 

In  this    study  the   second  type  of  I/O  problem  will  be   examined  "where 
MFACR  is   used  to   solve  Poisson's   equation.      For  an  example  "we  will  use 
512  by   512  meshes.      P.     .  where  1   <   i   <   6k   and  0   <   j    £   3  is    a  page   of  a 
mesh   containing  all  points   a,     .  where   8.    -   8  <  k    <  8.    and    128  j   <  £  <  128  j  +  128. 

If  the  prognostic   equations   are   solved  separately  using  ASK  code  they 
are  found  to  be  about   50%  I/O  bound.      To  decrease  the  I/O  bound  the  Fourier 
analysis  of  MFACR  will  be  performed  along  with  the   calculation  of  the  prog- 
nostic  equations. 

This  will  be   followed  by  CRED,  with  the   final   step  being  Fourier 
synthesis . 

The  data  will  be  stored  in  blocks  22  pages  long.      Block  number  I,   B 
contains  P      .  "where   0   <   j    <   3  of  n        ,    1 '        ,    n  ,   T(t~      ,    and  rT'    for 

time  step   t.      Also  included  in  B     are  the  two  pages   required  for  the  read 
or  write  interpret.    Thirteen  blocks   can  be  placed  on  one  band  with  lU  pages 
remaining.      The   data  will  be   spread  across  the  11  bands  of  disk  with  22 
pages   separating  each  block  to   leave  room  for  the   information   from  PEM  to 
be  written.      Seven  blocks   of  data  and  six  blocks   for  writing  are  found  on 
bands   0,   2,   ht   6,    and  8.      Six  blocks  of  data  and  seven  blocks   for  writing 
are  found  on  bands  1,    3,    5,    and  rJ.      Band  9   contains   5  blocks  of  data  and 
seven  blocks   for  writing.      Band  10   contains  the  last  block  for  writing. 
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The  "blocks  are  skewed  to  enable  efficient  access  of  rows  and  columns. 
The  first  block  on  band  I  starts  at  page  -2 1  mod  300. 

The  prognostic  equations  and  the  Fourier  analysis  are  performed  while 
the  disk  revolves  ten  times  and  110  pages  are  passed  on  the  eleventh  revolu- 
tion.  Figure  6  shows  what  data  will  be  printed  on  the  first  kh   pages  of 
each  band. 

Only  one  column  of  pages  can  fit  in  PEM  at  one  time.   Since  CRED 
requires  an  entire  column  before  the  calculations  can  be  performed,  a 
column  is  read  into  PEM,  the  calculations  are  performed  and  the  column  is 
printed  on  disk.   The  data  has  been  skewed  in  such  a  way  as  to  enable  an 
entire  column  to  be  read  or  written  in  one  revolution,  see  Figure  6. 
There  are  four  columns  and  the  calculation  of  each  takes  l/k   of  a  revolu- 
tion.  The  total  I/O  time  for  CRED  is  nine  revolutions. 

The  final  step,  Fourier  Synthesis,  is  performed  in  four  revolutions. 
Figure  6  shows  that  this  can  be  accomplished  on  the  first  hk   pages  by  cal- 
culating rows  5 ?  31  and  57  on  the  first  revolution,  rows  18  and  U4  on  the 
second  revolution,  rows  12,  38  >  6k   on  the  third  revolution  and  rows  25,  51 
on  the  last  revolution.   To  enable  ^T+  '   to  overwrite  >jAT+1/^  Row  I  of 
yU+  /3)    ig  written  3  data  blocks  before  Row  I  of  Y ^T+1/3)  during  CRED.* 

The  entire  time  step  required  23  revolutions  and  110  pages  on  the  2l+th 
revolution.   This  takes  932  milliseconds.   The  calculation  could  be  per- 
formed in  333  milliseconds.   Thus  the  process  is  6h%   I/O  bound. 

A  large  portion  of  this  I/O  bound  is  created  by  the  solution  of 
Poisson's  equation.   The  last  two  steps  of  MFACR  required  13  revolutions 
of  the  disk  while  the  calculations  for  those  steps  required  the  amount 

of  time  for  2^-  revolut ions . 

*   (t+1/3) 

t        contains  the  harmonics  calculated  by  Fourier  analysis. 

(t+2/3) 

¥  contains  the  harmonics    calculated  by  CRED. 
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To  minimize  total   computer  time     a  problem  takes,  a  programmer  must 
adapt   algorithms  to    disk  maps  which   give  minimum  I/O  time.      With   I/O 
being  such   a  dominant    factor  in  the   above  problem  the  use  of  library 
subroutines  becomes   minimal  because  programs  need  to  be  customized  to 
minimize   I/O. 


Figure  6.   A  Disk  Map  of  512  by  512  Meshes 
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5.       CONCLUSIONS 

Table   13   compares   MFACR   and  FACR  with   SOR   and  ADI    for   Dirichlet's 
boundary    conditions.      The   direct   methods   use  modified  Thomas'   method  to 
perform  CRED   and  Cooley's   methods   to  perform  the  Fourier  part    of  the 
algorithm.      Unless    an    excellent    initial    guess    is    supplied  for  the   itera- 
tive methods,    the   direct   methods    are  much    faster. 

For  maximum  machine   efficiency,    MFACR  requires   64k  by   2   +1 
meshes,    FACR  requires   128k  by  2   +1  meshes,    ADI  requires   64k  by   641 
meshes,    and  SOR   requires    64k  by   j    meshes.         As    seen   in   section   2,    SOR 
can  partition   its  mesh  to  obtain  maximum  machine   efficiency.      If  the  mesh 
for  one  of  the  other  methods   did  not  meet   the   row  restrictions,    the  mesh 
could  not   be  partitioned  to  meet   the  restrictions.      A  possible   solution 
would  be  to   solve  a  number  of  meshes   at   the   same  time. 


[MESH  SIZE 
!M+2  by  N+2 

FACR 

MFACR 

SOR  FEE 
ITERATION 

ADI  PER 
ITERATION 

33  by  63 

6.51  or 
25300 

5.71  or 
22400 

I  or 
3900 

3.81  or 
14900 

65  by  63 

5.31  or 
49500 

4.91  or 
39100 

I  or 
7900 

2.51  or 

20100 

65  by  12? 

3.71  or 
59500 

4.71  or 
74000 

I  or 
15900 

2.61 
40700 

129  by  127 

3. 81  or 
122000 

4.91  or 
156000 

I  or 
32000 

2.71  or 
87700 

Table   13.    A  Comparison  of  Methods  with  Respect   to  Time' 


Two   numbers    are   supplied.      The  number  times    I   is  the  number  of 
iterations   which   could  be   performed  in   an  equivalent   amount   of  time. 
The  other  number   Is   the   number  of  clocks   required  on   ILLIAC   IV.      The 
times   given   for  SOR   in   section  2.3   don't   take   into   account   the  time 
required  to   check  for   convergence.      To   calculate  the  times    for  SOR 
in   Table  13,20%  has  been  added  to  the  times   in  section  2.3  to    account 
for   convergence   checking. 
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Direct  methods  require  only  one  mesh  in  core.   The  mesh  initially  con- 
tains the  interior  source  points  and  the  "boundary  conditions.   Throughout 
the  algorithm,  the  mesh  is  used  as  temporary  storage  with  the  values  in 
the  mesh  set  equal  to  the  solution  in  the  final  step.   Iterative  methods 
require  hoth  the  source  and  the  solution  mesh  at  every  iteration. 
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APPENDIX  A 
TIMING  METHODOLOGY 

The  procedure  for  estimating  the  amount  of  time  required  by  an 
algorithm  is  explained  below.   Care  has  been  taken  to  insure  that  the 
method  of  calculating  the  computer  time  does  not  give  one  algorithm  an 
unwarranted  advantage  over  another. 

In  determining  the  amount  of  time  an  algorithm  takes  on  a  computer 
we  must  consider  the  language  in  which  the  algorithm  is  programmed.   In 
comparing  GLYPNIR  to  ASK,  one  finds  that  the  average  GLYPNIR  code  takes 
about  twice  as  long  as  the  ASK  code  for  the  same  problem.   Two  major 
reasons  for  this  difference  are  memory  fetching  and  indexing.   The 
code  generated  by  GLYPNIR  for  indexing  can  be  improved  considerably  by 
rewriting  it  in  ASK.   GLYPNIR  does  fetching  and  storing  of  data  between 
PEM  and  PE  registers  that  can  be  eliminated  by  efficient  use  of  the  PE 
registers  in  ASK. 

GLYPNIR  code  is  much  easier  to  write  than  ASK.   One  must  choose 
between  simplicity  and  speed.   However,  GLYPNIR  permits  the  programmer 
to  write  sections  of  code  in  ASK.   Thus  the  programmer  can  write  code  in 
GLYPNIR  and  then  rewrite  the  statements  which  are  executed  most  often  in 
ASK.   In  estimating  the  time  for  the  different  methods,  we  have  timed 
the  statements  which  are  executed  in  the  inner  loops  of  the  iterative 
method.   Since  these  statements  take  most  of  the  computer  time  they 
should  be  written  in  ASK.   We  use  the  GLYPNIR  statements  as  an  outline 
of  the  method  but  the  time  estimates  reflect  efficient  ASK  code,  not  the 
code  that  GLYPNIR  would  generate.   To  show  the  reader  how  we  arrived  at 
our  time  estimates  we  have  included  a  table  with  assignment  statements, 
operations  which  are  not  overlapped  and  the  clocks  required  for  each  method, 
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The  assignment  statements  are  written  in  a  combination  of  GLYPNIR  and  ASK. 
The  parts  of  the  statements  which  are  not  used  in  GLYPNIR  are  defined  in 
Table  lk.      A  summary  of  the  assumptions  used  in  obtaining  the  time  esti- 
mates is  found  in  Table  15. 


NOTATION 

MEANING 

NOTATION 

MEANING 

$A 

PE  Register  A 

$Ci 

ACAR  i  for  i  =0,1, 2,  3 

$B 

PE  Register  B 

U(i) 

U  indexed  by  ACAR  i 

$R 

PE  Register  R 

u[$x] 

U  indexed  by  register  X 

$S 

PE  Register  S 

U[$X](i) 

both  $X  and  $Ci  indexing  of  U 

$X 

PE  Register  X 

$Di 

ADB  storage  location  i  for  i=0, . . 

-,63 

Table  1^-  Assignment  Statement  Notation 

Every  iterative  algorithm  we  consider  in  this  study  will  have  two  parts: 
Phase  1  will  be  used  to  improve  6h   points  simultaneously;   Phase  2  will 
improve  up  to  63  points  simultaneously.   The  machine  efficiency  of  Phase  1 
is  greater  than  the  machine  efficiency  of  Phase  2.   If  one  used  only  Phase  1 
on  certain  meshes  the  extra  time  needed  to  bring  the  data  to  the  PE's  would 
exceed  the  time  saved  by  improving  machine  efficiency.   Thus  the  fastest 
code  is  obtained  by  a  combination  of  Phase  1  and  Phase  2. 


1)  CU  instructions  are  overlapped  completely  by  a  combination  of  FINST/ 
PE  instructions  and  PEM  fetches. 

2)  Memory  fetching  is  overlapped  as  much  as  possible. 

3)  Seven  PE  clocks  are  used  to  load  (Store)  a  PE  register  from  (to)  PEM. 
h)     One  PE  clock  issued  to  load  a  PE  register  from  the  CU. 

5)  Transfer  between  PE  registers  require  one  clock. 

6)  Each  GRABONE  will  be  considered  to  be  an  ASK  LOAD  instruction 
(  10  clocks). 


Table  15.   Assumptions  in  Timing 


The  author  feels  these  timing  assumptions  should  give  times  within  20$ 
of  the  actual  times  the  algorithms  would  take  on  ILLIAC  IV. 
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To   aid  in   comparing  times   of  different  methods  we  will   define   an 
execution  of  Phase  1  to  be  the   improvement   of  6h  points    simultaneously  "by 
Phase  1.      Similarly,    an   execution  of  Phase  2   is  the   improvement  of  k  points 
where  Phase  2   improves   k  points   simultaneously. 
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APPENDIX  B 
MAJOR   STEPS   OF  FACR  AND  MFACR  FOR   DIRICHLET'S  BOUNDARY  CONDITIONS 

Given  the  meshes    ¥  and    U  with  mesh  points    at   locations    y.        and  U 
where  0   <   i   ^  M  and  0  ^  j    ^  N  the   five  point   difference  scheme  approxima- 
tions of  Poisson's   equation  becomes 


U 


s-l,t        2Us,t    +   us+l,t  Us,t-1   ~   2Us,t    +  Us,t+1        ,.,  ,         . 

.2  u2  s,t  KOmX) 

h  h  ' 

x  y 

To    solve    (B.l)    FACR  works    as    follows. 

Odd/even  reduction    (see  Section   3.1)    calculates    y     for  the   even   columns 
of   V.      Equation    ( Ul )    from  Section   3.1   can  "be  expanded  into  the  following 
eauation   for  individual  mesh  points. 

i 

^2  (Ui,t-2  +  Vt+2'  '  ff  <u  i.2,t  +  <W,t} 

y  x 


o 


+  k 


,h 


2  v  /     h 

y_  -  _L_  \    m  +  tt  l  -     kJt  +  A+^lu.^T^   (B.2) 


* 


-^♦r?  ("i-i^w1-  "r^r^r7 r^ 


h  h 


h  h  h 

xxy 


i,t 


\     X  x 

where 

'V*  ■  ^,t-i  -  ^  v* +  w +  2f  ^ +  4  fi.t +  v-l5-2-1' 


Fourier  analysis  calculates  the  Fourier  harmonics  ¥S  and  y*  defined  by 

M-l 

'i^ll'Vt^f.  (B.3) 

k=l    ' 


Combining    (B.2)    and    (B.3)   we   obtain 

s        2 M-1         k-  f 

!i,t=I     I      fSIN  V       "2    (Uk  t-2   +  \  t+2}    '  +  U  ) 

k=l  l^h  k,t-d         it,t  ^  4  k-2,t  k+2,ty 


M-l  h   2 


7  hx 


+  ^  +  h7](Vi^Vi,t)-^^^^^)Vt^  (b,) 
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The   Fourier  harmonics   for  U,    U     are   defined  as   follows 


M-l 

US        =|     I     SIN  ^11  (B.5) 

i,t        M      *  M        k,t 

ri — _L 


and  thus 
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U  =      Y        SIN  ^  U*   ,     .  (B.6) 


J  =      I        SIN  ^  U*   + 

k,t         /  m       i,t 

Combining   (B.5),    (B.6)    and   (B.U)   we  obtain 


ts      =  _i_  (us  +  us         )  -  I  6  V  +  A  +  A  I  us 

i,t  2        i,t+2  i,t-2J  pT  2  S    I      i,t 

y  *  X  X  y      ' 

-  m-i_        .  .   r    h  2    M-i  ,_   0xn  ,lx0« 

M  k=iL      M  V   ti      *=i  £*  *» 

X 

+ *  (if* + 75)  I(SIN  ^I.* +  SIN  ^  u*  J)] 

\    x  y     / 


(b.t: 


Using  the  facts  that  Sin(a±b)  =  Sin  a  Cos  "b  ±  Sin  b  Cos  a  and 

M-l  .,         „ 

I     Sin  ^  Sin  ^  =  5-  6.        for  t,i  =  1,   2,    ,..,   M-l  where 
L  M  M  2      1,1 

k=l 


fO  if  i 
'i,      =  [l   if  i 


7^    £ 
<$  •  :   1    1    -,■ -r  n    =    0      we  can      decouple    (B.T)    into   a  tridiagonal  linear 


svstem 

h2 


*;.*  ■  p  ^,t« +  «lt-2'  -  2  ^ cos  ^  - 8 h^ +  ^)cos 

y  '       x  *    x  x 


,     2 

hy       .      8       .     2 


^^+-T<t  (B.8) 

h  h  h       ' 

x  x  y 


CRED  solves    (B.8)    for  US, 
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Fourier  synthesis  calculates  U  from  U  using  (B.6). 


This  gives  us  the  values  for  U  on  the  even  columns. 

Using  the  values  of  U  on  the  even  columns  we  can  solve  (B.l)  for 
the  values  of  U  on  the  odd  columns  using  a  method  which  solves  tridiagonal 
linear  systems. 

In  summary  FACR  performs  the  following  algorithm: 

1.  Given  ¥  compute  ¥  for  the  even  columns  using  (B.2.1)  and  overwrite 

y  with  f  on  the  even  columns. 

*  c  *        s 

2.  Using  V     compute  T    according  to  (B-3)  and  overwrite  V     with  ¥  . 

3.  Using  ^  solve  (B.8)  for  if  and  overwrite  ^  with  if. 

k.      Using  U  compute  U  according  to  (B.6)  on  the  even  columns  and  over- 

s 
write  U  with  U  on  the  even  columns. 

5.   Using  the  values  of  U  on  the  even  columns  solve  (B.l)  for  the  values 

of  U  on  the  odd  columns  and  overwrite  H*  with  U  on  the  odd  columns. 

MFACR  is  similar  to  FACR  but  the  odd/even  reduction  step  is  omitted. 
MFACR  begins  with  Fourier  analysis  which  calculates  the  Fourier  harmonics 
T    for  ¥  using 
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U'S  ,  =  %        I         Y,    SIN  -rr 
*l,t    M  ,  ^   Tk,t       M 
k=l 


(B.9) 


From  B.  9    ve   obtain 

'U  -  2U  +  U 

-l,t s  ,t s+1  ,t 

h2 

X 


„   M-l  ,  .    /U      n 

i,t        M     *■  M\ 

'  k=l 


Us,t-1  ~   2Us,t  *  Us,tH-l\  (B.10 

h2         / 

y 


B-U 


Using  the   same  methods    employed  to  obtain    (B.8)    from    'B.7)we  obtain 
(B.H)    from   (B.10). 

,s        =      1         (us  +    us  )    +    (    2      cos  I±  2      _      2  U  (  } 

i,t        h  2  i,t+l  i,t-l  h  2  M       h  2        h  21   i,t 

y  y  y         x/ 

CRED  solves    (B.ll)    for   l/f      . 

i,t 

Finally    (B.l)    is   used  to  cal oxalate    U  for  the   entire  mesh. 
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